Nuttx QEMU x86 Part 2
NuttX on x86 with QEMU Part 2ish
Well first off it seems that I failed at actually keeping this up to-date. To be honest I forgot about this entirely until someone asked me about something related to NuttX. I got distracted focusing into GNURadio and a few related projects that I should also get around to writing about. Shortly after the last post I started to dig into to supporting at least 32bit, but quickly ran into odd behavior switching tasks, a rather important feature of the OS. This post is mostly just going through my debugging process, but I tried to add context and useful links and tools that I used. Unfortunately there are a few places where my notes were lacking some information that would have been nice to include. This post does not go into solving the bug, but I will follow up with something closer to a solution in a later post.
Keeping notes and then later trying to write this up is an exercise I would encourage others to take the time to go through. It gave me time to reflect on the process as well as force me to document so much of the context that I try to keep in my head in the moment. For reference my raw notes were only 200 lines, mostly text dumps of tools and nearly no prose, trying to place myself back in the moment with only this and what I still remembered from a month ago was mentally exhausting. My hope is returning to the problem from this document will greatly reduce the cognitive load.
Getting Started
One of the most useful configurations to build Nuttx with when bringing up a new architecture or even just a SoC variant is ostest. You are going to test some basic IO, interrupts and some clocking, all of which are rather important to do anything useful.
After running that build in debug mode we get a GPT fault.
up_assert: Assertion failed at file:irq/irq_unexpectedisr.c line: 65 task: Idle Task
up_registerdump: ds:00fe0010 irq:0000000d err:00000064
up_registerdump: edi:00000064 esi:0011f160 ebp:0011f160 esp:0011d5b0
up_registerdump: ebx:00000064 edx:00000000 ecx:00000000 eax:00fea68a
up_registerdump: eip:00000460 cs:00000008 flg:00000082 sp:00000064 ss:00000064
Understanding the Exception
From this we can quickly see the interrupt request (IRQ) is 13, which unfortunately is the General Protection Fault (GPT). This fault can mean a few different things that are outlined in the Intel manual for the architecture, but there is a good summary over here https://wiki.osdev.org/Exceptions#General_Protection_Fault. Usually though it is a bad instruction being executed because the processor has started executing corrupt data or jumped to some place it should not be. This is almost always caused by the stack getting corrupted. We are operating with very small stack sizes so this is not that uncommon, especially with the ostest application.
A quick test is to just increase the default stack size and the stack size for the idle task. Unfortunately this made no difference.
We need to dig into this deeper. We know from the documentation on the GPT,
that the instruction pointer that caused the issue is stored in the EIP
register 00000460
.
Now is probably a good time to take a high level look at how memory is organized to understand why this address does not make sense.
Hello Linker Script
Bellow is a simplified version of the linker script used by this application, I have removed the debug and gnu/gcc specific information.
OUTPUT_ARCH(i386)
ENTRY(__start)
SECTIONS
{
. = 0x00100000;
.text : {
_stext = ABSOLUTE(.);
*(.text .text.*)
_etext = ABSOLUTE(.);
}
.data ALIGN (0x1000) : {
_sdata = ABSOLUTE(.);
*(.data .data.*)
CONSTRUCTORS
_edata = ABSOLUTE(.);
}
.bss : {
_sbss = ABSOLUTE(.);
*(.bss .bss.*)
*(COMMON)
_ebss = ABSOLUTE(.);
}
}
While the details for the SECTIONS part of the file are contained in the ld
documentation https://sourceware.org/binutils/docs/ld/SECTIONS.html
It can for the most part be read from top to bottom. We see the first section
defined at the memory address 0x00100000
. Then the .text
section is
defined after that. Inside of the .text
section _stext
is assigned to be
the start of the .text
section. This is now a symbol that will be added to
the symbol table with global scope by the linker. The next line
*(.text .text.*)
defines the input sections, so foo.text
and foo.text.bar
would both be placed in the .text
output section. _etext
is very similar
to _stext
providing a symbol that representing the memory address that
closes off the .text
output section.
The .data
and .bss
sections are basically the same. ALIGN
for .data
is going to shift the start address to make sure it is on a 0x1000
boundary,
CONSTRUCTORS
is a special area related to CPP, and COMMON
is an area for
uninitialized data.
It takes some time to get a hang of linker scripts, and they are quite powerful , but the real take away should be this:
.text
starts at0x00100000
.data
comes after.text
.bss
comes after.data
- There are symbols defined for the start and end of each of these sections
While the section names do not actually matter, the inputs sections that they correspond to do. I am taking some liberties here but simplistically:
.text
is the code segment, so most of instructions for the application will be here..bss
is statically allocated, but uninitialized data.data
is statically allocated, and initialized data, you may have also seenrodata
this is data that is read only.
As far as the linker is concerned based on our linker script the memory where it
will place the code and data starts at 0x00100000
which is stored in _stext
and ends at _ebss
. Note this does not mean the the code will not reference
code outside of this region.
There is another note I want to make here. There are some times important
memory regions for things like the idle thread stack and the heap defined
in the entry function for the architecture for x86 on QEMU that is here
arch/x86/src/qemu/qemu_head.S
The most important thing to keep in mind right now from that file are these
three lines:
#define STACKBASE ((_ebss + 0x1f) & 0xffffffe0)
#define IDLE_STACK (STACKBASE+CONFIG_IDLETHREAD_STACKSIZE)
#define HEAP_BASE (STACKBASE+CONFIG_IDLETHREAD_STACKSIZE)
The idle thread stack is after the end of the .bss
section and then after
that stack is the heap which among other allocations will also be used for
allocating thread stacks when they are created.
Second Look at Our Exception
Hopefully you can now see why it was concerning that the instruction pointer
that triggered the GPF was at 0x00000460
, all the code should have been
defined in the .text
section which starts at 0x00100000
. Technically we
could have been doing something clever and executing instructions that were
stored in variables in .data
, .bss
or heap but those are also in later
areas. What this means is something unexpected happened that caused the
processor to jump to that area.
WARNING I’m about to take some serious liberties about how a processor works
Let’s take a detour to a high level discussion of what jumping means for a processor. In normal execution with no interrupts, the processor has an entry point, where it starts reading from performing instructions step by step in order. Now almost all the code we write has loops and if statements in it these are hard or impossible to represent in a linear fashion, so we have conditional and conditional jump instructions that perform logic like these lines:
b = (a < 10 ? 1: 0)
if a < 10: goto foo
We also have instructions that are just that goto
statement. For x86 that
instruction is jmp
which is actually 21 different opp codes depending on what
foo
is.
This is actually outlined well here, including some pseudo code: https://css.csail.mit.edu/6.858/2014/readings/i386/JMP.htm
If you think about how a function call might work you could define the code the represents the function in some area of memory and then use this jump instruction to go execute it. In that case we know the exact address of the function that we need to go to, so that is not likely to get corrupted. But at the end of the function execution we need to return and that will be different for every call, so before issuing the jump, the context is saved. This context is saved on an area of memory called the stack. If that context is corrupted in the process of executing the function we run the risk of not restoring it properly and then among other risks, not returning to the correct code execution.
For a more detailed look at how this actually works including stack frames take
a look at
https://en.wikibooks.org/wiki/X86_Disassembly/Functions_and_Stack_Frames it’s
also worth learning a bit about what the ret
and call
instructions do and
how they are like a jmp
instruction (they do important things related to
the stack) you would not use a jmp
in practice to call or return from
a function.
One helpful technique for detecting stack corruption is to “colorize” it. The idea is you fill the region of memory with an easy see read pattern like “0xdeadbeef” and then inspect the region of memory looking for two things:
- The last bytes to be some different pattern, suggesting that the code may have run out of the allocated stack and then corrupted random other memory past what was allocated (most common)
- Areas where the stack has been modified followed by a region unmodified followed but some modified data. This might suggest that this thread was not responsible for corrupting data, but that another thread may have written over its stack. This is not a hard rule, but if you see this pattern you should probably look into what might have caused it.
The logic for colorizing the stack is not actually implemented for the x86 architecture in the upstream code base, only ARM and RISC-V right now.
I have a WIP progress patch to add this: https://github.com/btashton/nuttx/commit/50eaa76131282ec3d0bf6f3e838ed3b974598f39.patch
Using this it was clear that the stack was not getting corrupted in these ways.
Because the instruction that was last executed was in a garbage area of memory it is hard to unwind how we got there. What we can do is leverage QEMU to trace the processor execution.
How Did We Get Here
If you supply QEMU with the debug flag -d in_asm
it will dump all input
assembly code for each translation block it executes. If the debug
symbols are available it will try to match the execution pointer for the block
to the symbol. This makes it a lot easier to see what functions are getting
executed. Running the ostest again with this flag we can walk back from the
assertion handler up_assert
and see what the processor was doing up to this
point. This wont help if the corruption occurred much earlier, but usually
when memory starts to get corrupted things fail shortly after.
----------------
IN: up_block_task
0x00113920: 83 ec 0c subl $0xc, %esp
0x00113923: 57 pushl %edi
0x00113924: e8 a8 05 ff ff calll 0x103ed1
----------------
IN: up_block_task
0x00113929: 83 c4 10 addl $0x10, %esp
0x0011392c: 85 c0 testl %eax, %eax
0x0011392e: 75 e1 jne 0x113911
----------------
IN: up_block_task
0x00113930: a1 b0 cb 11 00 movl 0x11cbb0, %eax
0x00113935: 83 ec 0c subl $0xc, %esp
0x00113938: 83 c0 78 addl $0x78, %eax
0x0011393b: 50 pushl %eax
0x0011393c: e8 c4 05 ff ff calll 0x103f05
----------------
IN:
0x00000212: 00 f0 addb %dh, %al
0x00000214: 53 pushl %ebx
0x00000215: ff 00 incl 0(%eax)
0x00000217: f0 .byte 0xf0
0x00000218: 53 pushl %ebx
0x00000219: ff 00 incl 0(%eax)
0x0000021b: f0 .byte 0xf0
0x0000021c: 53 pushl %ebx
0x0000021d: ff 00 incl 0(%eax)
0x0000021f: f0 .byte 0xf0
0x00000220: 53 pushl %ebx
0x00000221: ff 00 incl 0(%eax)
<<<<Lots more garbage>>>>>
----------------
IN:
0x00000441: 00 00 addb %al, 0(%eax)
0x00000443: 00 00 addb %al, 0(%eax)
0x00000445: 00 00 addb %al, 0(%eax)
0x00000447: 00 00 addb %al, 0(%eax)
0x00000449: 03 50 00 addl 0(%eax), %edx
0x0000044c: 00 10 addb %dl, 0(%eax)
0x0000044e: 00 00 addb %al, 0(%eax)
0x00000450: 00 08 addb %cl, 0(%eax)
0x00000452: 00 00 addb %al, 0(%eax)
0x00000454: 00 00 addb %al, 0(%eax)
0x00000456: 00 00 addb %al, 0(%eax)
0x00000458: 00 00 addb %al, 0(%eax)
0x0000045a: 00 00 addb %al, 0(%eax)
0x0000045c: 00 00 addb %al, 0(%eax)
0x0000045e: 00 00 addb %al, 0(%eax)
0x00000460: 07 popl %es
----------------
IN:
0x00103fc4: fa cli
0x00103fc5: 6a 0d pushl $0xd
0x00103fc7: e9 1f 01 00 00 jmp 0x1040eb
----------------
IN: isr_common
0x001040eb: 60 pushal
0x001040ec: 66 8c d8 movw %ds, %ax
0x001040ef: 50 pushl %eax
0x001040f0: 66 b8 10 00 movw $0x10, %ax
0x001040f4: 8e d8 movl %eax, %ds
----------------
IN: isr_common
0x001040f6: 8e c0 movl %eax, %es
----------------
IN: isr_common
0x001040f8: 8e e0 movl %eax, %fs
0x001040fa: 8e e8 movl %eax, %gs
0x001040fc: 89 e0 movl %esp, %eax
0x001040fe: 50 pushl %eax
0x001040ff: e8 6c 00 00 00 calll 0x104170
----------------
IN: isr_handler
0x00104170: 83 ec 14 subl $0x14, %esp
0x00104173: 8b 44 24 18 movl 0x18(%esp), %eax
0x00104177: a3 84 cf 11 00 movl %eax, 0x11cf84
0x0010417c: 50 pushl %eax
0x0010417d: 8b 40 24 movl 0x24(%eax), %eax
0x00104180: 50 pushl %eax
0x00104181: e8 da 85 00 00 calll 0x10c760
----------------
IN: irq_unexpected_isr
0x0010c7b0: 83 ec 14 subl $0x14, %esp
0x0010c7b3: 9c pushfl
0x0010c7b4: 58 popl %eax
0x0010c7b5: fa cli
0x0010c7b6: 6a 41 pushl $0x41
0x0010c7b8: 68 4a a8 11 00 pushl $0x11a84a
0x0010c7bd: e8 8e 6f 00 00 calll 0x113750
----------------
IN: up_assert
0x00113750: 57 pushl %edi
0x00113751: 56 pushl %esi
0x00113752: 53 pushl %ebx
0x00113753: 8b 1d b0 cb 11 00 movl 0x11cbb0, %ebx
0x00113759: e8 12 11 00 00 calll 0x114870
This was exactly what we wanted to see we were executing the up_block_task
function and then all of a sudden we start running code from 0x00000212
which is unexpected, the data there is valid x86 instructions by chance,
until we finally reach 0x00000460
which causes the GPT fault that is
handled by the interrupt handler isr_common
. This does not mean that
up_block_task
was responsible for the corruption, but it is a good place to
to start.
What is up_block_task doing?
TODO: Include some description about how task switching works. It would make this section a lot easier to follow. For now I point you to http://www.nuttx.org/doku.php?id=wiki:nxinternal:synchronous-switches
We can look at the c code for the function to get some high level overview, and gdb can do a fairly good job of matching the assembly code up with the c code, but in this case I really want to know what is going on with the registers to aline it up with what we see in the QEMU output. For this I love using Radare2, it is a powerful tool for reverse engineering, but I have found it fit nicely in my just regular engineering toolkit. The out of the box experience with it is not great, but once you work though the manual a bit and play around with some tutorials online, it will start to make sense. I have not used IDA Pro, but I have heard it is much easier to use, but you will have to pay for it.
I need to apologize here I will try to recreate this later, but my notes
unfortunately mixed two different builds so the addresses do not match quite
right, and the notes I have miss the transition from one thread calling
up_block_task
which then takes us to up_unblock_task
. Pretend that the QEMU
output before we returned to up_unblock_task
looked like this:
IN: up_fullcontextrestore
0x001041b5: 58 popl %eax
0x001041b6: cf iretl
----------------
IN: up_unblock_task
0x00104129: 83 .byte 0x83
Note that this does not mean that 0x00104129
is the start of up_unblock_task
it is in-fact not, it is just a location somewhere inside of that function.
This is a chunk of the Radare2 disassembly at the function:
[0x001040d0 1% 220 nuttx.elf]> pd $r @ sym.up_unblock_task
/ (fcn) sym.up_unblock_task 111
| sym.up_unblock_task (int arg_1ch);
| ; var int local_4h @ esp+0x4
| ; arg int arg_1ch @ esp+0x1c
| ; XREFS: CALL 0x0010094c CALL 0x001011f9 CALL 0x001012a7 CALL 0x0010170e
| ; XREFS: CALL 0x001018ec CALL 0x00104124 CALL 0x0010d0af CALL 0x0010d55e
| ; XREFS: CALL 0x0010f7f6 CODE 0x00114fc5
| 0x001040d0 b 56 push esi
| 0x001040d1 53 push ebx
| 0x001040d2 83ec10 sub esp, 0x10
| 0x001040d5 8b5c241c mov ebx, dword [arg_1ch] ; [0x1c:4]=-1
| 0x001040d9 8b35b0d31100 mov esi, dword [obj.g_readytorun] ; [0
| 0x001040df 53 push ebx
| 0x001040e0 e86bb00000 call sym.sched_removeblocked ;[1]
| 0x001040e5 891c24 mov dword [esp], ebx
| 0x001040e8 e8a3ae0000 call sym.sched_addreadytorun ;[2]
| 0x001040ed 83c410 add esp, 0x10
| 0x001040f0 84c0 test al, al
| ,=< 0x001040f2 7426 je 0x10411a ;[3]
| | 0x001040f4 83c678 add esi, 0x78 ; 'x'
| | 0x001040f7 8b1584d71100 mov edx, dword [obj.g_current_regs] ;
| | 0x001040fd 85d2 test edx, edx
| ,==< 0x001040ff 741f je 0x104120 ;[4]
| || 0x00104101 83ec0c sub esp, 0xc
| || 0x00104104 b 56 push esi
| || 0x00104105 e846feffff call sym.up_savestate ;[5]
| || 0x0010410a a1b0d31100 mov eax, dword [obj.g_readytorun] ; [0
| || 0x0010410f 83c078 add eax, 0x78 ; 'x'
| || 0x00104112 a384d71100 mov dword [obj.g_current_regs], eax ;
| || 0x00104117 83c410 add esp, 0x10
| || ; CODE XREFS from sym.up_unblock_task (0x1040f2, 0x10412e)
| .-`-> 0x0010411a 58 pop eax
| :| 0x0010411b 5b pop ebx
| :| 0x0010411c 5e pop esi
| :| 0x0010411d c3 ret
:| 0x0010411e 6690 nop
| :`--> 0x00104120 83ec0c sub esp, 0xc
| : 0x00104123 56 push esi
| : 0x00104124 e818000000 call sym.up_saveusercontext ;[6]
| : 0x00104129 83c410 add esp, 0x10
| : 0x0010412c 85c0 test eax, eax
| `===< 0x0010412e 75ea jne 0x10411a ;[3]
| 0x00104130 a1b0d31100 mov eax, dword [obj.g_readytorun] ; [0
We returned from up_fullcontextrestore
which was started execution at
0x00104129
this code is about to take execution back to a thread that was
suspended. In our case we take the conditional jump, pop some registers from
the stack and then return:
| .-`-> 0x0010411a 58 pop eax
| :| 0x0010411b 5b pop ebx
| :| 0x0010411c 5e pop esi
| :| 0x0010411d c3 ret
This ret
instruction is about to do a lot of nuanced things
https://c9x.me/x86/html/file_module_x86_id_280.html
Remember the high level issue that we had was we were running garbage code?
Let’s take a look at the address that we are about to go to. From the
instruction guide we know that ret
“Transfers program control to a return
address located on the top of the stack”. Using gdb we can set a break point
on address 0x0010411d
and then read the esp
register:
(gdb) i r esp
esp 0x11dea0 0x11dea0
TODO: Should have shown the values for the symbols defined in the linker script.
0x11dea0
is in the region after .bss
that means we are about to start
executing code out of the stack or heap, that would be highly unusual.
Something is corrupt in the context we are restoring. By using gdb we can look at the calls to up_unblocktask that we might be returning to. For this code the only likely options is actually the idle task.
and we can have gdb print a nice stack trace for us so we dont have to manually parse the stack frames:
Breakpoint 6, up_unblock_task (tcb=0x11f460) at common/up_unblocktask.c:72
72 {
(gdb) bt
#0 up_unblock_task (tcb=0x11f460) at common/up_unblocktask.c:72
#1 0x00100951 in task_activate (tcb=0x11f460) at task/task_activate.c:92
#2 0x0010048c in thread_create (name=0x1152c2 "init", ttype=ttype@entry=0 '\000',
priority=100, stack_size=2048, entry=0x104b10 <ostest_main>, argv=0x0)
at task/task_create.c:169
#3 0x00100519 in nxtask_create (name=<optimized out>, priority=<optimized out>,
stack_size=<optimized out>, entry=0x104b10 <ostest_main>, argv=0x0)
at task/task_create.c:233
#4 0x00100351 in os_do_appstart () at init/os_bringup.c:266
#5 os_start_application () at init/os_bringup.c:379
#6 os_bringup () at init/os_bringup.c:453
#7 0x001002d2 in os_start () at init/os_start.c:827
#8 0x00100034 in __start () at chip/qemu_head.S:134
This should be basically the same backtrace that we should see after the call
to up_unblock_task
returns.
With that same break point we placed above at 0x0010411d
we can issue the
backtrace command to gdb and see what is surely going to be garbage.
(gdb) bt
#0 0x0010411d in up_unblock_task (tcb=0x115087 <nxsig_timedwait+183>)
at common/up_unblocktask.c:122
#1 0x0011df10 in ?? ()
#2 0x00000000 in ?? ()
This might indicate the stack is corrupt, but remember that it did not look like anything had obviously trampled on the stack when colorization was turned on? What if we are just not restoring (or storing) the context correctly?
Since we are trying to restore the idle task, we know that exact location
of its stack that was defined in qemu_head.S
. Let’s just dump that memory
region:
(gdb) x/512wx &_ebss
<snip to the interesting bit>
0x11de00: 0x0011fa23 0x001034a8 0x0000000a 0x00000010
0x11de10: 0x00000031 0x001034a8 0x0000000a 0x0011de30
0x11de20: 0x0011fa23 0x00110e0e 0x0000000a 0x00000000
0x11de30: 0x00000020 0x00110e0e 0x0000000a 0x00000037
0x11de40: 0x0011f7a4 0x001034a8 0x0000000a 0x00000037
0x11de50: 0x0011f7a4 0x001103f4 0x0011c120 0x0011d628
0x11de60: 0x00120320 0x0011093a 0x00120320 0x0011d3cc
0x11de70: 0x00000000 0x00000001 0x00104129 0x00000008
0x11de80: 0x00000016 0x00114241 0x0011c0b8 0x00000006
0x11de90: 0x1dcd6500 0x0011507c 0x00120320 0x00000000
0x11dea0: 0x0011df10 0x00115087 0x00120320 0x00000006
0x11deb0: 0x00120320 0x00115023 0x00000012 0x0011c120
0x11dec0: 0x00000000 0x00000000 0x0000000a 0x0011f7a4
0x11ded0: 0x0011fa20 0x00000000 0x0000000a 0x00000000
0x11dee0: 0x0011dfbc 0x0011df80 0x0011df10 0x00000293
0x11def0: 0x0000000e 0x00114d32 0x0011df10 0x00000000
0x11df00: 0x0011df80 0x00000293 0x0011dfbc 0x00000030
0x11df10: 0x00000000 0x00102645 0x0000000a 0x00000000
0x11df20: 0x0011546e 0x00000000 0x00000000 0x00000000
0x11df30: 0x0012a99c 0x00114e62 0x0011df80 0x00000000
0x11df40: 0x0011546f 0x00112817 0x0011dfbc 0x0000000a
0x11df50: 0x00000005 0x00000000 0x00000000 0x00000000
0x11df60: 0x00000000 0x001137a7 0x00000000 0x00000000
0x11df70: 0x0011df80 0x00000000 0x0011d29c 0x00112e70
0x11df80: 0x00000000 0x1dcd6500 0x00000020 0x0011dfcc
0x11df90: 0x00000000 0x001047c5 0x0007a120 0x00000000
0x11dfa0: 0x0011f460 0x0010443a 0x00000020 0x0011dfcc
0x11dfb0: 0x0011e018 0x0010442a 0x0011fac0 0x0011f460
0x11dfc0: 0x00000000 0x0010438c 0x0011dfcc 0x00000010
0x11dfd0: 0x00000000 0x00000000 0x00000000 0x00000000
0x11dfe0: 0x00000000 0x001009a6 0x00000005 0x0012a99c
0x11dff0: 0x00000020 0x00000000 0x00100970 0x00000008
0x11e000: 0x00000202 0x00103612 0x00120398 0x0011545c
0x11e010: 0x00115470 0x0011f460 0x0011f7a8 0x00000000
0x11e020: 0x00000000 0x001009e0 0x00000000 0x00000000
0x11e030: 0x00000000 0x00000000 0x00000000 0x0011e050
0x11e040: 0x00000000 0x001009ae 0x00000000 0x001202b0
0x11e050: 0x00000020 0x00000000 0x00100970 0x00000008
0x11e060: 0x00000202 0x00104141 0x0011f4d8 0x0011f518
0x11e070: 0x0000001f 0x00000064 0x0011f460 0x00000212
0x11e080: 0x00000000 0x00100951 0x0011f460 0x00000000
0x11e090: 0x00000064 0x00000000 0x00000000 0x00000064
0x11e0a0: 0x00000001 0x0010048c 0x0011f460 0x001152c2
0x11e0b0: 0x00000000 0x00000000 0x00000003 0x0011c528
0x11e0c0: 0x0011e14c 0x001152c2 0x00000006 0x001152a8
0x11e0d0: 0x0011e0e4 0x0011e14c 0x00000000 0x00129000
0x11e0e0: 0x00000000 0x00100351 0x00000800 0x00104b10
0x11e0f0: 0x00000000 0x00104b10 0x00000000 0x001152a8
0x11e100: 0x001152c8 0x0010364b 0x00000001 0x0011e14c
0x11e110: 0x0011e14c 0x001002d2 0x00000002 0x00000006
0x11e120: 0x00000000 0x0010027e 0xdeadbeef 0xdeadbeef
0x11e130: 0x0011e150 0x000e1eb0 0xdeadbeef 0xdeadbeef
0x11e140: 0xdeadbeef 0x00100034 0x0011e14c 0x0011e14c
We know that the context restore is determing a stack point at 0x11dea0
and the top of the stack holds the instruction we are returning to
0x0011df10
which we also know is not correct. We did capture the backtrace
from the idle task prior to the task switch so we know what the return address
should be 0x00100951
. Looking at the stack that value is located at
0x11e084
.
gdb will happily set that register for us (gdb) set $sp = 0x11e084
And then we can check the back trace to see if the frames look correct.
(gdb) bt
#0 0x0010411d in up_unblock_task (tcb=0x11f460) at common/up_unblocktask.c:122
#1 0x00100951 in task_activate (tcb=0x11f460) at task/task_activate.c:92
#2 0x0010048c in thread_create (name=0x1152c2 "init", ttype=ttype@entry=0 '\000',
priority=100, stack_size=2048, entry=0x104b10 <ostest_main>, argv=0x0)
at task/task_create.c:169
#3 0x00100519 in nxtask_create (name=<optimized out>, priority=<optimized out>,
stack_size=<optimized out>, entry=0x104b10 <ostest_main>, argv=0x0)
at task/task_create.c:233
#4 0x00100351 in os_do_appstart () at init/os_bringup.c:266
#5 os_start_application () at init/os_bringup.c:379
#6 os_bringup () at init/os_bringup.c:453
#7 0x001002d2 in os_start () at init/os_start.c:827
#8 0x00100034 in __start () at chip/qemu_head.S:134
Beautiful! If we tell gdb to continue the application will run for a bit and then throw another exception. This is not that unexpected we manually edited a register and did not actually fix the root issue.
What’s Next
For now that is as far as I have gotten, but hopefully some time in the next week I will be able to dig into the actual bug in the context save/restore logic. That should be documented in part 3.