Kernel Memory Allocator 0xdeadbeef Checking
Jason Banham of Nexenta asks me if I can go through an example of using kmem_flags
for debugging kernel memory allocator problems. kmem_flags
is defined in kmem_impl.h. Documentation exists in Debugging With the Kernel Memory Allocator, (in Modular Debugger Guide), and in Chapter 11.4 of "Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture (2nd Edition)", McDougall and Mauro, ISBN-13: 978-0131482098.
We'll start by looking at a panic dump that does not havekmem_flags
set. Note that I wrote a little driver thatcauses the panic "on purpose" in order to have an example that I canuse in this blog. I do this because, of course, SmartOS itselfdoesn't have any bugs...Code is available here. I'm doingthis in the global zone on SmartOS.
# cd /var/crash/volatile# lsbounds METRICS.csv vmdump.1# savecore -f vmdump.1savecore: System dump time: Wed Jul 10 23:36:02 2013savecore: saving system crash dump in /var/crash/volatile/{unix,vmcore}.1Constructing namelist /var/crash/volatile/unix.1Constructing corefile /var/crash/volatile/vmcore.1 0:03 100% done: 88882 of 88882 pages saved## mdb 1Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs lofs idm mpt crypto random sd cpc logindmux ptm kvm sppp nsmb smbsrv nfs ]> ::statusdebugging crash dump vmcore.1 (64-bit) from 00-0c-29-3d-b0-99operating system: 5.11 joyent_20130529T211010Z (i86pc)image uuid: (not set)panic message: BAD TRAP: type=d (#gp General protection) rp=ffffff00055689f0 addr=ffffff0005568b18dump content: kernel pages only>
So, the panic was caused by a general protection fault. If this was auser level core file, a general protection fault would either be asegmentation violation or a bus error. Let's take alook at the location where the panic occurred. We'll do this byexamining the instruction pointer register.
>
This instruction is moving the contents of the rdx
register into the location pointed to by the rax
register. Let's see where rax
points.
>
That does not look like an address.
>
So, the list_remove()
function is trying to write to aninvalid address. Note that list_remove()
is a documentedfunction (see list_remove(9f)). While it is possible that the function has abug, it is highly unlikely. More likely is that the caller to thefunction is passing in an invalid argument. In fact, the bad addresslooks like ascii.
> 3f3f3f3f3f3f3f3f=s????????>
The "address" is actually a set of question marks!?Looking at the call stack shows:
> $clist_remove+0x1b(ffffff014f753d08, ffffff014ee6f840)zil_itxg_clean+0x97(ffffff01515f4a10)taskq_thread+0x2d0(ffffff014a887e80)thread_start+8()>
The zil_itxg_clean()
function is part of the ZFS code.It's defined here. Let's take a closer look at the list_remove()
function.
> list_remove::dislist_remove: pushq %rbplist_remove+1: movq %rsp,%rbplist_remove+4: subq $0x10,%rsplist_remove+8: movq %rsi,-0x10(%rbp)list_remove+0xc: addq 0x8(%rdi),%rsilist_remove+0x10: movq %rdi,-0x8(%rbp)list_remove+0x14: movq 0x8(%rsi),%raxlist_remove+0x18: movq (%rsi),%rdxlist_remove+0x1b: movq %rdx,(%rax)list_remove+0x1e: movq (%rsi),%raxlist_remove+0x21: movq 0x8(%rsi),%rdxlist_remove+0x25: movq %rdx,0x8(%rax)list_remove+0x29: movq $0x0,0x8(%rsi)list_remove+0x31: movq $0x0,(%rsi)list_remove+0x38: addq $0x10,%rsplist_remove+0x3c: leave list_remove+0x3d: ret >
For those unfamiliar with x86 64-bit assembler code, the first 6arguments to functions are placed in registers. Thelist_remove()
function takes 2 arguments. The firstargument to functions are put by the caller into the rdi
register, the second argument is put into the rsi
register. (Additional arguments, if any, are placed into otherregisters. If there are morethan 6 arguments, the additional arguments after the sixth are pushedon the stack. A complete description of this is beyond the scope ofthis blog post.)
Recall that the panic occurs at list_remove+0x1b
whenmoving data into the location specified by the rax
register. From examining the above code, the rax
register is initialized at list_remove+0x14
from 8 bytesoffset from the location specified in the rsi
register.The rsi
register is added with the contents of therdi
register, offset by 8 bytes atlist_remove+0xc
. The rdi
register containsthe first argument to list_remove()
, which should be alist_t *
. Let's look at the list_t
and thersi
register.
>
So, the list_remove()
routine adds thelist_offset
(which is the location in each member of thelist of the pointer to the next element in the list, and in this case,the "next" pointer is 0, so it is at the beginning of each element) tothe member of the list, and uses it to walk the members looking for amatch on the element being removed. Let's take a look at the elementpointed to by rsi
. This should be an itx_t*
(from looking at the code in zil_itxg_clean()
).
>
So, it looks like the element on the list (an itx_t *
) ispointing at data that contains a lot of question marks (0x3f is theascii code for a question mark). In other words, the list has beencorrupted.
Rather than spend a lot of time trying to figure out how this couldhave happened, we'll reboot the system and set kmem_flags
to 0xf. This will turn on some debugging flags for the kernel memoryallocator. Specifically, this turns on "deadbeef" checking, "redzone"checking, and "auditing". Note that some documentation says to setkmem_flags
to 0x1f. Doing that on SmartOS gives awarning message that the value is not supported. Also, I found theresulting dumps to be much more difficult to analyze. (In fact, theresulting panics did not seem to have anything to do with the normalchecking that deadbeef and redzone is meant to handle).
- Deadbeef checking will cause the kernel memory allocator to fill in
kmem_free(9f)
'd data with the hex pattern0xdeadbeef
. When data is allocated from a free list, the memory allocator checks to see if it is all 0xdeadbeef. If not, it causes the system to panic. This is useful for finding memory that is being used after it is freed. - Redzone checking tells the memory allocator to add the hex pattern
0xfeedface
for 64 bytes at the end of any dynamically allocated memory. When the allocated memory is freed, the memory allocator checks to see if the 0xfeedface pattern is still there. If not, it causes the system to panic. This is useful in finding situations where code has written past the end of the dynamically allocated memory. - Auditing will save a stack trace for each kernel memory allocation and free. We'll see how that can be used shortly.
There are a few ways to turn on the kmem_flags
variable.We could set it in /etc/system
, but since SmartOS uses aram disk for the root file system, it will not be persistent acrossboots. So, we'll do it at boot time. One of the choices from grub atboot time is:
Live 64-bit (text) +kmdb
When the system is coming up, we'll choose that option for booting.On the console, you'll see:
Loading kmdb...Welcome to kmdbLoaded modules: [ unix krtld genunix ][0]>
Here, we'll enter:
[0]> kmem_flags/W 0x1fkmem_flags: 0 = 0x1f[0]> :c
The system will print a warning about overhead using kmem_flags, butshould come up (unless it runs into a problem with dynamicallyallocated memory on the way up).
Once the system is up, we'll try to reproduce the problem. Afterabout 30 seconds (in this case), the system panics. (Without thekmem_flags
setting, the system could run for a long timewithout hitting the problem).When the system panics, because kmdb
has been loaded, itwill drop into kmdb(1). We'll just have it continue which will give us a crash dump. On the console:
kmdb: target stopped at:kmdb_enter+0xb: movq %rax,%rdi[0]> :csyncing file systems...
And the system should create a dump and reboot. (Note that I had to dothis twice, as the first time, the system paniced again when creatingthe dump).
# cd /var/crash/volatile# lsbounds METRICS.csv unix.1 vmcore.1 vmdump.1 vmdump.2## savecore -f vmdump.2savecore: System dump time: Thu Jul 11 22:01:09 2013savecore: saving system crash dump in /var/crash/volatile/{unix,vmcore}.2Constructing namelist /var/crash/volatile/unix.2Constructing corefile /var/crash/volatile/vmcore.2 0:02 100% done: 164627 of 164627 pages saved## mdb 2Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs lofs idm mpt crypto random sd cpc logindmux ptm kvm sppp nsmb smbsrv nfs ]>
Let's see how it paniced this time.
> ::statusdebugging crash dump vmcore.2 (64-bit) from 00-0c-29-3d-b0-99operating system: 5.11 joyent_20130529T211010Z (i86pc)image uuid: (not set)panic message: kernel heap corruption detecteddump content: kernel pages only>
The panic message: kernel heap corruption detected
istypically the result of deadbeef checking. When the kernel panicsfrom kmem_flags settings, information is sent to the console. Let'sget that information.
> ::msgbufMESSAGE ... <- output omittedkernel memory allocator: buffer modified after being freedmodification occurred at offset 0x0 (0xdeadbeefdeadbeef replaced by 0x3f3f3f3f3f3f3f3f)buffer=ffffff01513301c0 bufctl=ffffff01517780f0 cache: kmem_alloc_256previous transaction on buffer ffffff01513301c0:thread=ffffff0005c76c40 time=T-21.196117909 slab=ffffff0151351710 cache: kmem_alloc_256kmem_cache_free_debug+10fkmem_cache_free+123kmem_free+4ezil_itxg_clean+abtaskq_thread+2d0thread_start+8panic[cpu0]/thread=ffffff0004cc1c40: kernel heap corruption detectedffffff0004cc0e00 fffffffffba28064 ()ffffff0004cc0e90 genunix:kmem_cache_alloc_debug+1f7 ()ffffff0004cc0ef0 genunix:kmem_cache_alloc+d4 ()ffffff0004cc0f30 genunix:kmem_alloc+4b ()ffffff0004cc0f60 genunix:getrbuf+18 ()ffffff0004cc1030 scsi:scsi_alloc_consistent_buf+191 ()... <-- output omittedffffff0004cc1990 zfs:zio_execute+88 ()ffffff0004cc19f0 zfs:zio_notify_parent+db ()ffffff0004cc1a30 zfs:zio_ready+eb ()ffffff0004cc1a70 zfs:zio_execute+88 ()ffffff0004cc1b30 genunix:taskq_thread+2d0 ()ffffff0004cc1b40 unix:thread_start+8 ()syncing file systems... donedumping to /dev/zvol/dsk/zones/dump, offset 65536, content: kernel>
At this point, the buffer we are interested in,buffer=ffffff01513301c0
is being allocated bygetrbuf
called indirectly from ZFS code, and it tells usthe most recent transaction on the buffer, which waszil_itxg_clean()
freeing the buffer.Let's use the auditing flag we turned on to see what allocates andfrees this buffer. Note that this gave me ~8000 lines of output, soit has been mostly omitted.
> ffffff01513301c0::kmalogT-21.196116669 addr=ffffff01513301c0 kmem_alloc_256 kmem_cache_free_debug+0x10f kmem_cache_free+0x123 kmem_free+0x4e zil_itxg_clean+0xab taskq_thread+0x2d0 thread_start+8T-23.383043939 addr=ffffff01513301c0 kmem_alloc_256 kmem_cache_alloc_debug+0x2e8 kmem_cache_alloc+0xd4 kmem_alloc+0x4b zil_itx_create+0x33 zfs_log_write+0x109 zfs_write+0xb0e fop_write+0x8b write+0x250 write32+0x1eT-31.201836987 addr=ffffff01513301c0 kmem_alloc_256 kmem_cache_free_debug+0x10f kmem_cache_free+0x123 kmem_free+0x4e zil_itxg_clean+0xab taskq_thread+0x2d0 thread_start+8T-34.183272209 addr=ffffff01513301c0 kmem_alloc_256 kmem_cache_alloc_debug+0x2e8 kmem_cache_alloc+0xd4 kmem_alloc+0x4b zil_itx_create+0x33 zfs_log_write+0x109 zfs_write+0xb0e fop_write+0x8b write+0x250 write32+0x1e...T-78.879496922 addr=ffffff01513301c0 kmem_alloc_256 kmem_cache_free_debug+0x10f kmem_cache_free+0x123 kmem_free+0x4e kbad_bad+0x4b kbad_ioctl+0x2b cdev_ioctl+0x39 spec_ioctl+0x60 fop_ioctl+0x55 ioctl+0x9bT-78.879497144 addr=ffffff01513301c0 kmem_alloc_256 kmem_cache_alloc_debug+0x2e8 kmem_cache_alloc+0xd4 kmem_alloc+0x4b kbad_bad+0x17 kbad_ioctl+0x2b cdev_ioctl+0x39 spec_ioctl+0x60 fop_ioctl+0x55 ioctl+0x9b...
The above gives lots of output. Let's try looking atbufctl=ffffff01517780f0
(from the ::msgbuf
output).
We'll get all transactions for the specific buffer and bufctl, andjust get unique entries.
> ffffff01517780f0::bufctl -h -a ffffff01513301c0 ! cut -d " " -f 7 | sort -udoor_args+0x261door_call+0x1b5dv_clone_mknod+0x118dv_clone_mknod+0x9ffreerbuf+0x25getrbuf+0x18kbad_bad+0x17kbad_bad+0x4bpageio_done+0x5bpageio_setup+0x5dresolve_pathname+0x29aresolve_pathname+0xd4segvn_free+0x64segvn_vpage+0x5fvdev_disk_io_intr+0x47vdev_disk_io_start+0x53zil_itx_create+0x33zil_itxg_clean+0xab>
If we look at all of the output (i.e., leave off the sort -u
), most of these alloc/free sequencesoccur more than once. We'll start with ones that occur only one timeand examine the code to see if we can find where the pointer to theallocated data is used after it is freed. So,door_args+0x261
and door_call+0x1b5
occursmultiple times so we'll skip it. Butdv_clone_mknod+0x118
and <dv_clone_mknod+0x9fonly occurs once, so we could take a look at that code to see if theaddress returned from the allocation done atdv_clone_mknod+0x9f
is used after it is freed.If we get through all of the alloc/free sequences that occur onlyonce, and still have not found the bug, we could then look at the onesthat occur more than one time. A "defensive" (and probably easier) tactic wouldbe to go through the above routines that are freeing the buffer andset the pointer to NULL after the free. Something like:
kmem_free(addr, size);addr = 0;
This will cause a panic when the pointer (addr
) is used.
Rather than sift through all of the above examples, we'll look at onethat seems obvious: kbad_bad
. (Note that this is partof code I wrote specifically to cause the bug. Remember, there are nobugs in SmartOS :) ).
voidkbad_bad(void){ char *p; int nticks; p = kmem_alloc(KBADSIZE, KM_SLEEP); if (!p) { cmn_err(CE_WARN, "kbad_bad: out of memory\n"); return; } kmem_free(p, KBADSIZE); (void) timeout(kbad_timeout_handler, p, DELAYTIME); return;}
So, the routine allocates memory, frees it, and sets a timer (see timeout(9f)). Whenthe timer goes off, kbad_timeout_handler()
gets calledwith a pointer to the space that was just freed up as an argument.
static voidkbad_timeout_handler(void *arg){ int i; char *p = (char *)arg; for (i = 0; i < KBADSIZE; i++) *p++ = '?';}
And there is the bug. The kbad
module uses the pointer after it is freed.
There are several things to note about this example:
- Most of the time, the data will not be overwritten with question marks, but instead will be something that looks like a data structure or pointer that points into who knows what. This may or may not make the problem easier.
- There are equivalent flags for the libumem(3lib) library. See umem_debug(3MALLOC) for details. You can use
LD_PRELOAD=libumem cmd args...
to cause thelibumem
versions ofmalloc/free
to be used instead of the versions with which the code was built. (But not always. Some versions ofmalloc()
take different arguments). - It will usually not be this easy.
Post written by Mr. Max Bruning