Nuttx QEMU x86 Part 2

Wed, Dec 5, 2018

NuttX on x86 with QEMU Part 2ish

Well first off it seems that I failed at actually keeping this up to-date. To be honest I forgot about this entirely until someone asked me about something related to NuttX. I got distracted focusing into GNURadio and a few related projects that I should also get around to writing about. Shortly after the last post I started to dig into to supporting at least 32bit, but quickly ran into odd behavior switching tasks, a rather important feature of the OS. This post is mostly just going through my debugging process, but I tried to add context and useful links and tools that I used. Unfortunately there are a few places where my notes were lacking some information that would have been nice to include. This post does not go into solving the bug, but I will follow up with something closer to a solution in a later post.

Keeping notes and then later trying to write this up is an exercise I would encourage others to take the time to go through. It gave me time to reflect on the process as well as force me to document so much of the context that I try to keep in my head in the moment. For reference my raw notes were only 200 lines, mostly text dumps of tools and nearly no prose, trying to place myself back in the moment with only this and what I still remembered from a month ago was mentally exhausting. My hope is returning to the problem from this document will greatly reduce the cognitive load.

Getting Started

One of the most useful configurations to build Nuttx with when bringing up a new architecture or even just a SoC variant is ostest. You are going to test some basic IO, interrupts and some clocking, all of which are rather important to do anything useful.

After running that build in debug mode we get a GPT fault.

up_assert: Assertion failed at file:irq/irq_unexpectedisr.c line: 65 task: Idle Task

up_registerdump:  ds:00fe0010 irq:0000000d err:00000064
up_registerdump: edi:00000064 esi:0011f160 ebp:0011f160 esp:0011d5b0
up_registerdump: ebx:00000064 edx:00000000 ecx:00000000 eax:00fea68a
up_registerdump: eip:00000460  cs:00000008 flg:00000082  sp:00000064 ss:00000064

Understanding the Exception

From this we can quickly see the interrupt request (IRQ) is 13, which unfortunately is the General Protection Fault (GPT). This fault can mean a few different things that are outlined in the Intel manual for the architecture, but there is a good summary over here https://wiki.osdev.org/Exceptions#General_Protection_Fault. Usually though it is a bad instruction being executed because the processor has started executing corrupt data or jumped to some place it should not be. This is almost always caused by the stack getting corrupted. We are operating with very small stack sizes so this is not that uncommon, especially with the ostest application.

A quick test is to just increase the default stack size and the stack size for the idle task. Unfortunately this made no difference.

We need to dig into this deeper. We know from the documentation on the GPT, that the instruction pointer that caused the issue is stored in the EIP register 00000460.

Now is probably a good time to take a high level look at how memory is organized to understand why this address does not make sense.

Hello Linker Script

Bellow is a simplified version of the linker script used by this application, I have removed the debug and gnu/gcc specific information.

OUTPUT_ARCH(i386)
ENTRY(__start)
SECTIONS
{
	. = 0x00100000;

	.text : {
		_stext = ABSOLUTE(.);
		*(.text .text.*)
		_etext = ABSOLUTE(.);
	}

	.data ALIGN (0x1000) : {
		_sdata = ABSOLUTE(.);
		*(.data .data.*)
		CONSTRUCTORS
		_edata = ABSOLUTE(.);
	}

	.bss : {
		_sbss = ABSOLUTE(.);
		*(.bss .bss.*)
		*(COMMON)
		_ebss = ABSOLUTE(.);
	}
}

While the details for the SECTIONS part of the file are contained in the ld documentation https://sourceware.org/binutils/docs/ld/SECTIONS.html It can for the most part be read from top to bottom. We see the first section defined at the memory address 0x00100000. Then the .text section is defined after that. Inside of the .text section _stext is assigned to be the start of the .text section. This is now a symbol that will be added to the symbol table with global scope by the linker. The next line *(.text .text.*) defines the input sections, so foo.text and foo.text.bar would both be placed in the .text output section. _etext is very similar to _stext providing a symbol that representing the memory address that closes off the .text output section.

The .data and .bss sections are basically the same. ALIGN for .data is going to shift the start address to make sure it is on a 0x1000 boundary, CONSTRUCTORS is a special area related to CPP, and COMMON is an area for uninitialized data.

It takes some time to get a hang of linker scripts, and they are quite powerful , but the real take away should be this:

.text starts at 0x00100000
.datacomes after .text
.bss comes after .data
There are symbols defined for the start and end of each of these sections

While the section names do not actually matter, the inputs sections that they correspond to do. I am taking some liberties here but simplistically:

.text is the code segment, so most of instructions for the application will be here.
.bss is statically allocated, but uninitialized data
.data is statically allocated, and initialized data, you may have also seen rodata this is data that is read only.

As far as the linker is concerned based on our linker script the memory where it will place the code and data starts at 0x00100000 which is stored in _stext and ends at _ebss. Note this does not mean the the code will not reference code outside of this region.

There is another note I want to make here. There are some times important memory regions for things like the idle thread stack and the heap defined in the entry function for the architecture for x86 on QEMU that is here arch/x86/src/qemu/qemu_head.S The most important thing to keep in mind right now from that file are these three lines:

#define STACKBASE	((_ebss + 0x1f) & 0xffffffe0)
#define IDLE_STACK	(STACKBASE+CONFIG_IDLETHREAD_STACKSIZE)
#define HEAP_BASE	(STACKBASE+CONFIG_IDLETHREAD_STACKSIZE)

The idle thread stack is after the end of the .bss section and then after that stack is the heap which among other allocations will also be used for allocating thread stacks when they are created.

Second Look at Our Exception

Hopefully you can now see why it was concerning that the instruction pointer that triggered the GPF was at 0x00000460, all the code should have been defined in the .text section which starts at 0x00100000. Technically we could have been doing something clever and executing instructions that were stored in variables in .data, .bss or heap but those are also in later areas. What this means is something unexpected happened that caused the processor to jump to that area.

WARNING I’m about to take some serious liberties about how a processor works

Let’s take a detour to a high level discussion of what jumping means for a processor. In normal execution with no interrupts, the processor has an entry point, where it starts reading from performing instructions step by step in order. Now almost all the code we write has loops and if statements in it these are hard or impossible to represent in a linear fashion, so we have conditional and conditional jump instructions that perform logic like these lines:

b = (a < 10 ? 1: 0)
if a < 10: goto foo

We also have instructions that are just that goto statement. For x86 that instruction is jmp which is actually 21 different opp codes depending on what foo is.

This is actually outlined well here, including some pseudo code: https://css.csail.mit.edu/6.858/2014/readings/i386/JMP.htm

If you think about how a function call might work you could define the code the represents the function in some area of memory and then use this jump instruction to go execute it. In that case we know the exact address of the function that we need to go to, so that is not likely to get corrupted. But at the end of the function execution we need to return and that will be different for every call, so before issuing the jump, the context is saved. This context is saved on an area of memory called the stack. If that context is corrupted in the process of executing the function we run the risk of not restoring it properly and then among other risks, not returning to the correct code execution.

For a more detailed look at how this actually works including stack frames take a look at https://en.wikibooks.org/wiki/X86_Disassembly/Functions_and_Stack_Frames it’s also worth learning a bit about what the ret and call instructions do and how they are like a jmp instruction (they do important things related to the stack) you would not use a jmp in practice to call or return from a function.

One helpful technique for detecting stack corruption is to “colorize” it. The idea is you fill the region of memory with an easy see read pattern like “0xdeadbeef” and then inspect the region of memory looking for two things:

The last bytes to be some different pattern, suggesting that the code may have run out of the allocated stack and then corrupted random other memory past what was allocated (most common)
Areas where the stack has been modified followed by a region unmodified followed but some modified data. This might suggest that this thread was not responsible for corrupting data, but that another thread may have written over its stack. This is not a hard rule, but if you see this pattern you should probably look into what might have caused it.

The logic for colorizing the stack is not actually implemented for the x86 architecture in the upstream code base, only ARM and RISC-V right now.

I have a WIP progress patch to add this: https://github.com/btashton/nuttx/commit/50eaa76131282ec3d0bf6f3e838ed3b974598f39.patch

Using this it was clear that the stack was not getting corrupted in these ways.

Because the instruction that was last executed was in a garbage area of memory it is hard to unwind how we got there. What we can do is leverage QEMU to trace the processor execution.

How Did We Get Here

If you supply QEMU with the debug flag -d in_asm it will dump all input assembly code for each translation block it executes. If the debug symbols are available it will try to match the execution pointer for the block to the symbol. This makes it a lot easier to see what functions are getting executed. Running the ostest again with this flag we can walk back from the assertion handler up_assert and see what the processor was doing up to this point. This wont help if the corruption occurred much earlier, but usually when memory starts to get corrupted things fail shortly after.


----------------
IN: up_block_task
0x00113920:  83 ec 0c                 subl     $0xc, %esp
0x00113923:  57                       pushl    %edi
0x00113924:  e8 a8 05 ff ff           calll    0x103ed1

----------------
IN: up_block_task
0x00113929:  83 c4 10                 addl     $0x10, %esp
0x0011392c:  85 c0                    testl    %eax, %eax
0x0011392e:  75 e1                    jne      0x113911

----------------
IN: up_block_task
0x00113930:  a1 b0 cb 11 00           movl     0x11cbb0, %eax
0x00113935:  83 ec 0c                 subl     $0xc, %esp
0x00113938:  83 c0 78                 addl     $0x78, %eax
0x0011393b:  50                       pushl    %eax
0x0011393c:  e8 c4 05 ff ff           calll    0x103f05

----------------
IN:
0x00000212:  00 f0                    addb     %dh, %al
0x00000214:  53                       pushl    %ebx
0x00000215:  ff 00                    incl     0(%eax)
0x00000217:  f0                       .byte    0xf0
0x00000218:  53                       pushl    %ebx
0x00000219:  ff 00                    incl     0(%eax)
0x0000021b:  f0                       .byte    0xf0
0x0000021c:  53                       pushl    %ebx
0x0000021d:  ff 00                    incl     0(%eax)
0x0000021f:  f0                       .byte    0xf0
0x00000220:  53                       pushl    %ebx
0x00000221:  ff 00                    incl     0(%eax)

<<<<Lots more garbage>>>>>

----------------
IN:
0x00000441:  00 00                    addb     %al, 0(%eax)
0x00000443:  00 00                    addb     %al, 0(%eax)
0x00000445:  00 00                    addb     %al, 0(%eax)
0x00000447:  00 00                    addb     %al, 0(%eax)
0x00000449:  03 50 00                 addl     0(%eax), %edx
0x0000044c:  00 10                    addb     %dl, 0(%eax)
0x0000044e:  00 00                    addb     %al, 0(%eax)
0x00000450:  00 08                    addb     %cl, 0(%eax)
0x00000452:  00 00                    addb     %al, 0(%eax)
0x00000454:  00 00                    addb     %al, 0(%eax)
0x00000456:  00 00                    addb     %al, 0(%eax)
0x00000458:  00 00                    addb     %al, 0(%eax)
0x0000045a:  00 00                    addb     %al, 0(%eax)
0x0000045c:  00 00                    addb     %al, 0(%eax)
0x0000045e:  00 00                    addb     %al, 0(%eax)
0x00000460:  07                       popl     %es

----------------
IN:
0x00103fc4:  fa                       cli
0x00103fc5:  6a 0d                    pushl    $0xd
0x00103fc7:  e9 1f 01 00 00           jmp      0x1040eb

----------------
IN: isr_common
0x001040eb:  60                       pushal
0x001040ec:  66 8c d8                 movw     %ds, %ax
0x001040ef:  50                       pushl    %eax
0x001040f0:  66 b8 10 00              movw     $0x10, %ax
0x001040f4:  8e d8                    movl     %eax, %ds

----------------
IN: isr_common
0x001040f6:  8e c0                    movl     %eax, %es

----------------
IN: isr_common
0x001040f8:  8e e0                    movl     %eax, %fs
0x001040fa:  8e e8                    movl     %eax, %gs
0x001040fc:  89 e0                    movl     %esp, %eax
0x001040fe:  50                       pushl    %eax
0x001040ff:  e8 6c 00 00 00           calll    0x104170

----------------
IN: isr_handler
0x00104170:  83 ec 14                 subl     $0x14, %esp
0x00104173:  8b 44 24 18              movl     0x18(%esp), %eax
0x00104177:  a3 84 cf 11 00           movl     %eax, 0x11cf84
0x0010417c:  50                       pushl    %eax
0x0010417d:  8b 40 24                 movl     0x24(%eax), %eax
0x00104180:  50                       pushl    %eax
0x00104181:  e8 da 85 00 00           calll    0x10c760

----------------
IN: irq_unexpected_isr
0x0010c7b0:  83 ec 14                 subl     $0x14, %esp
0x0010c7b3:  9c                       pushfl
0x0010c7b4:  58                       popl     %eax
0x0010c7b5:  fa                       cli
0x0010c7b6:  6a 41                    pushl    $0x41
0x0010c7b8:  68 4a a8 11 00           pushl    $0x11a84a
0x0010c7bd:  e8 8e 6f 00 00           calll    0x113750

----------------
IN: up_assert
0x00113750:  57                       pushl    %edi
0x00113751:  56                       pushl    %esi
0x00113752:  53                       pushl    %ebx
0x00113753:  8b 1d b0 cb 11 00        movl     0x11cbb0, %ebx
0x00113759:  e8 12 11 00 00           calll    0x114870

This was exactly what we wanted to see we were executing the up_block_task function and then all of a sudden we start running code from 0x00000212 which is unexpected, the data there is valid x86 instructions by chance, until we finally reach 0x00000460 which causes the GPT fault that is handled by the interrupt handler isr_common. This does not mean that up_block_task was responsible for the corruption, but it is a good place to to start.

What is up_block_task doing?

TODO: Include some description about how task switching works. It would make this section a lot easier to follow. For now I point you to http://www.nuttx.org/doku.php?id=wiki:nxinternal:synchronous-switches

We can look at the c code for the function to get some high level overview, and gdb can do a fairly good job of matching the assembly code up with the c code, but in this case I really want to know what is going on with the registers to aline it up with what we see in the QEMU output. For this I love using Radare2, it is a powerful tool for reverse engineering, but I have found it fit nicely in my just regular engineering toolkit. The out of the box experience with it is not great, but once you work though the manual a bit and play around with some tutorials online, it will start to make sense. I have not used IDA Pro, but I have heard it is much easier to use, but you will have to pay for it.

I need to apologize here I will try to recreate this later, but my notes unfortunately mixed two different builds so the addresses do not match quite right, and the notes I have miss the transition from one thread calling up_block_task which then takes us to up_unblock_task. Pretend that the QEMU output before we returned to up_unblock_task looked like this:

IN: up_fullcontextrestore
0x001041b5:  58                       popl     %eax
0x001041b6:  cf                       iretl    

----------------
IN: up_unblock_task
0x00104129:  83                       .byte    0x83

Note that this does not mean that 0x00104129 is the start of up_unblock_task it is in-fact not, it is just a location somewhere inside of that function.

This is a chunk of the Radare2 disassembly at the function:

[0x001040d0 1% 220 nuttx.elf]> pd $r @ sym.up_unblock_task                             
/ (fcn) sym.up_unblock_task 111                                                        
|   sym.up_unblock_task (int arg_1ch);                                                 
|           ; var int local_4h @ esp+0x4                                               
|           ; arg int arg_1ch @ esp+0x1c                                               
|           ; XREFS: CALL 0x0010094c  CALL 0x001011f9  CALL 0x001012a7  CALL 0x0010170e
|           ; XREFS: CALL 0x001018ec  CALL 0x00104124  CALL 0x0010d0af  CALL 0x0010d55e
|           ; XREFS: CALL 0x0010f7f6  CODE 0x00114fc5                                  
|           0x001040d0 b    56             push esi                                    
|           0x001040d1      53             push ebx                                    
|           0x001040d2      83ec10         sub esp, 0x10                               
|           0x001040d5      8b5c241c       mov ebx, dword [arg_1ch]       ; [0x1c:4]=-1
|           0x001040d9      8b35b0d31100   mov esi, dword [obj.g_readytorun]       ; [0
|           0x001040df      53             push ebx                                    
|           0x001040e0      e86bb00000     call sym.sched_removeblocked ;[1]           
|           0x001040e5      891c24         mov dword [esp], ebx                        
|           0x001040e8      e8a3ae0000     call sym.sched_addreadytorun ;[2]           
|           0x001040ed      83c410         add esp, 0x10                               
|           0x001040f0      84c0           test al, al                                 
|       ,=< 0x001040f2      7426           je 0x10411a                 ;[3]            
|       |   0x001040f4      83c678         add esi, 0x78               ; 'x'           
|       |   0x001040f7      8b1584d71100   mov edx, dword [obj.g_current_regs]       ; 
|       |   0x001040fd      85d2           test edx, edx                               
|      ,==< 0x001040ff      741f           je 0x104120                 ;[4]            
|      ||   0x00104101      83ec0c         sub esp, 0xc                                
|      ||   0x00104104 b    56             push esi                                    
|      ||   0x00104105      e846feffff     call sym.up_savestate       ;[5]            
|      ||   0x0010410a      a1b0d31100     mov eax, dword [obj.g_readytorun]       ; [0
|      ||   0x0010410f      83c078         add eax, 0x78               ; 'x'           
|      ||   0x00104112      a384d71100     mov dword [obj.g_current_regs], eax       ; 
|      ||   0x00104117      83c410         add esp, 0x10                               
|      ||   ; CODE XREFS from sym.up_unblock_task (0x1040f2, 0x10412e)                 
|     .-`-> 0x0010411a      58             pop eax                                     
|     :|    0x0010411b      5b             pop ebx                                     
|     :|    0x0010411c      5e             pop esi                                     
|     :|    0x0010411d      c3             ret                                         
      :|    0x0010411e      6690           nop                                         
|     :`--> 0x00104120      83ec0c         sub esp, 0xc                                
|     :     0x00104123      56             push esi                                    
|     :     0x00104124      e818000000     call sym.up_saveusercontext ;[6]            
|     :     0x00104129      83c410         add esp, 0x10                               
|     :     0x0010412c      85c0           test eax, eax                               
|     `===< 0x0010412e      75ea           jne 0x10411a                ;[3]            
|           0x00104130      a1b0d31100     mov eax, dword [obj.g_readytorun]       ; [0

We returned from up_fullcontextrestore which was started execution at 0x00104129 this code is about to take execution back to a thread that was suspended. In our case we take the conditional jump, pop some registers from the stack and then return:

|     .-`-> 0x0010411a      58             pop eax                                     
|     :|    0x0010411b      5b             pop ebx                                     
|     :|    0x0010411c      5e             pop esi                                     
|     :|    0x0010411d      c3             ret

This ret instruction is about to do a lot of nuanced things https://c9x.me/x86/html/file_module_x86_id_280.html

Remember the high level issue that we had was we were running garbage code? Let’s take a look at the address that we are about to go to. From the instruction guide we know that ret “Transfers program control to a return address located on the top of the stack”. Using gdb we can set a break point on address 0x0010411d and then read the esp register:

(gdb) i r esp
esp            0x11dea0	0x11dea0

TODO: Should have shown the values for the symbols defined in the linker script.

0x11dea0 is in the region after .bss that means we are about to start executing code out of the stack or heap, that would be highly unusual.

Something is corrupt in the context we are restoring. By using gdb we can look at the calls to up_unblocktask that we might be returning to. For this code the only likely options is actually the idle task.

and we can have gdb print a nice stack trace for us so we dont have to manually parse the stack frames:

Breakpoint 6, up_unblock_task (tcb=0x11f460) at common/up_unblocktask.c:72
72	{
(gdb) bt
#0  up_unblock_task (tcb=0x11f460) at common/up_unblocktask.c:72
#1  0x00100951 in task_activate (tcb=0x11f460) at task/task_activate.c:92
#2  0x0010048c in thread_create (name=0x1152c2 "init", ttype=ttype@entry=0 '\000', 
    priority=100, stack_size=2048, entry=0x104b10 <ostest_main>, argv=0x0)
    at task/task_create.c:169
#3  0x00100519 in nxtask_create (name=<optimized out>, priority=<optimized out>, 
    stack_size=<optimized out>, entry=0x104b10 <ostest_main>, argv=0x0)
    at task/task_create.c:233
#4  0x00100351 in os_do_appstart () at init/os_bringup.c:266
#5  os_start_application () at init/os_bringup.c:379
#6  os_bringup () at init/os_bringup.c:453
#7  0x001002d2 in os_start () at init/os_start.c:827
#8  0x00100034 in __start () at chip/qemu_head.S:134

This should be basically the same backtrace that we should see after the call to up_unblock_task returns.

With that same break point we placed above at 0x0010411d we can issue the backtrace command to gdb and see what is surely going to be garbage.

(gdb) bt
#0  0x0010411d in up_unblock_task (tcb=0x115087 <nxsig_timedwait+183>)
    at common/up_unblocktask.c:122
#1  0x0011df10 in ?? ()
#2  0x00000000 in ?? ()

This might indicate the stack is corrupt, but remember that it did not look like anything had obviously trampled on the stack when colorization was turned on? What if we are just not restoring (or storing) the context correctly?

Since we are trying to restore the idle task, we know that exact location of its stack that was defined in qemu_head.S. Let’s just dump that memory region:

(gdb) x/512wx &_ebss 
<snip to the interesting bit>
0x11de00:	0x0011fa23	0x001034a8	0x0000000a	0x00000010
0x11de10:	0x00000031	0x001034a8	0x0000000a	0x0011de30
0x11de20:	0x0011fa23	0x00110e0e	0x0000000a	0x00000000
0x11de30:	0x00000020	0x00110e0e	0x0000000a	0x00000037
0x11de40:	0x0011f7a4	0x001034a8	0x0000000a	0x00000037
0x11de50:	0x0011f7a4	0x001103f4	0x0011c120	0x0011d628
0x11de60:	0x00120320	0x0011093a	0x00120320	0x0011d3cc
0x11de70:	0x00000000	0x00000001	0x00104129	0x00000008
0x11de80:	0x00000016	0x00114241	0x0011c0b8	0x00000006
0x11de90:	0x1dcd6500	0x0011507c	0x00120320	0x00000000
0x11dea0:	0x0011df10	0x00115087	0x00120320	0x00000006
0x11deb0:	0x00120320	0x00115023	0x00000012	0x0011c120
0x11dec0:	0x00000000	0x00000000	0x0000000a	0x0011f7a4
0x11ded0:	0x0011fa20	0x00000000	0x0000000a	0x00000000
0x11dee0:	0x0011dfbc	0x0011df80	0x0011df10	0x00000293
0x11def0:	0x0000000e	0x00114d32	0x0011df10	0x00000000
0x11df00:	0x0011df80	0x00000293	0x0011dfbc	0x00000030
0x11df10:	0x00000000	0x00102645	0x0000000a	0x00000000
0x11df20:	0x0011546e	0x00000000	0x00000000	0x00000000
0x11df30:	0x0012a99c	0x00114e62	0x0011df80	0x00000000
0x11df40:	0x0011546f	0x00112817	0x0011dfbc	0x0000000a
0x11df50:	0x00000005	0x00000000	0x00000000	0x00000000
0x11df60:	0x00000000	0x001137a7	0x00000000	0x00000000
0x11df70:	0x0011df80	0x00000000	0x0011d29c	0x00112e70
0x11df80:	0x00000000	0x1dcd6500	0x00000020	0x0011dfcc
0x11df90:	0x00000000	0x001047c5	0x0007a120	0x00000000
0x11dfa0:	0x0011f460	0x0010443a	0x00000020	0x0011dfcc
0x11dfb0:	0x0011e018	0x0010442a	0x0011fac0	0x0011f460
0x11dfc0:	0x00000000	0x0010438c	0x0011dfcc	0x00000010
0x11dfd0:	0x00000000	0x00000000	0x00000000	0x00000000
0x11dfe0:	0x00000000	0x001009a6	0x00000005	0x0012a99c
0x11dff0:	0x00000020	0x00000000	0x00100970	0x00000008
0x11e000:	0x00000202	0x00103612	0x00120398	0x0011545c
0x11e010:	0x00115470	0x0011f460	0x0011f7a8	0x00000000
0x11e020:	0x00000000	0x001009e0	0x00000000	0x00000000
0x11e030:	0x00000000	0x00000000	0x00000000	0x0011e050
0x11e040:	0x00000000	0x001009ae	0x00000000	0x001202b0
0x11e050:	0x00000020	0x00000000	0x00100970	0x00000008
0x11e060:	0x00000202	0x00104141	0x0011f4d8	0x0011f518
0x11e070:	0x0000001f	0x00000064	0x0011f460	0x00000212
0x11e080:	0x00000000	0x00100951	0x0011f460	0x00000000
0x11e090:	0x00000064	0x00000000	0x00000000	0x00000064
0x11e0a0:	0x00000001	0x0010048c	0x0011f460	0x001152c2
0x11e0b0:	0x00000000	0x00000000	0x00000003	0x0011c528
0x11e0c0:	0x0011e14c	0x001152c2	0x00000006	0x001152a8
0x11e0d0:	0x0011e0e4	0x0011e14c	0x00000000	0x00129000
0x11e0e0:	0x00000000	0x00100351	0x00000800	0x00104b10
0x11e0f0:	0x00000000	0x00104b10	0x00000000	0x001152a8
0x11e100:	0x001152c8	0x0010364b	0x00000001	0x0011e14c
0x11e110:	0x0011e14c	0x001002d2	0x00000002	0x00000006
0x11e120:	0x00000000	0x0010027e	0xdeadbeef	0xdeadbeef
0x11e130:	0x0011e150	0x000e1eb0	0xdeadbeef	0xdeadbeef
0x11e140:	0xdeadbeef	0x00100034	0x0011e14c	0x0011e14c

We know that the context restore is determing a stack point at 0x11dea0 and the top of the stack holds the instruction we are returning to 0x0011df10 which we also know is not correct. We did capture the backtrace from the idle task prior to the task switch so we know what the return address should be 0x00100951. Looking at the stack that value is located at 0x11e084.

gdb will happily set that register for us (gdb) set $sp = 0x11e084

And then we can check the back trace to see if the frames look correct.

(gdb) bt

#0  0x0010411d in up_unblock_task (tcb=0x11f460) at common/up_unblocktask.c:122
#1  0x00100951 in task_activate (tcb=0x11f460) at task/task_activate.c:92
#2  0x0010048c in thread_create (name=0x1152c2 "init", ttype=ttype@entry=0 '\000', 
    priority=100, stack_size=2048, entry=0x104b10 <ostest_main>, argv=0x0)
    at task/task_create.c:169
#3  0x00100519 in nxtask_create (name=<optimized out>, priority=<optimized out>, 
    stack_size=<optimized out>, entry=0x104b10 <ostest_main>, argv=0x0)
    at task/task_create.c:233
#4  0x00100351 in os_do_appstart () at init/os_bringup.c:266
#5  os_start_application () at init/os_bringup.c:379
#6  os_bringup () at init/os_bringup.c:453
#7  0x001002d2 in os_start () at init/os_start.c:827
#8  0x00100034 in __start () at chip/qemu_head.S:134

Beautiful! If we tell gdb to continue the application will run for a bit and then throw another exception. This is not that unexpected we manually edited a register and did not actually fix the root issue.

What’s Next

For now that is as far as I have gotten, but hopefully some time in the next week I will be able to dig into the actual bug in the context save/restore logic. That should be documented in part 3.