Fun with stack corruption

Today, we were seeing a very odd crash in some xen code. The core dump wasn't of great use, since both %eip and %ebp were zeroed, which means no backtrace. Instead I attached mdb to the process and started stepping through to see what was happening. It soon transpired that we were crashing after successfully executing a C function called xspy_introduce_domain(), but before we got back to the Python code that calls into it. After a little bit of head-scratching, I looked closer at the assembly for this function:`xspy_introduce_domain:            pushl  %ebp`xspy_introduce_domain+1:          movl   %esp,%ebp
...`xspy_introduce_domain+0x57:       subl   $0xc,%esp`xspy_introduce_domain+0x5a:       leal   -0xc(%ebp),%eax`xspy_introduce_domain+0x5d:       pushl  %eax`xspy_introduce_domain+0x5e:       leal   -0x8(%ebp),%eax`xspy_introduce_domain+0x61:       pushl  %eax`xspy_introduce_domain+0x62:       leal   -0x2(%ebp),%eax`xspy_introduce_domain+0x65:       pushl  %eax`xspy_introduce_domain+0x66:       pushl  $0xc4b13114`xspy_introduce_domain+0x6b:       pushl  0xc(%ebp)`xspy_introduce_domain+0x6e:       call   +0x43595220      
...`xspy_introduce_domain+0x11f:      leave`xspy_introduce_domain+0x120:      ret

Seems OK - we're pushing three pointers onto the stack (+0x5a-0x65) and two other arguments. Let's look at the sources:

static PyObject *xspy_introduce_domain(XsHandle *self, PyObject *args)
    domid_t dom;
    unsigned long page;
    unsigned int port;

    struct xs_handle *xh = xshandle(self);
    bool result = 0;

    if (!xh)
        return NULL;
    if (!PyArg_ParseTuple(args, "ili", &dom, &page, &port))
        return NULL;

Looking up the definition of PyArg_ParseTuple() gave me the clue as to the problem. The format string specifies that we're giving the addresses of an int, long, and int. Yet in the assembly, the offsets of the leal instructions indicate we're pushing addresses to two 32-bit storage slots, and one 16-bit slot. So when PyArg_ParseTuple() writes its 32-bit quantity, it's going to overwrite two more bytes than it should.

As it happens, we're at the very top of the local stack storage space (-0x2(%ebp)). So those two bytes actually end up over-writing the bottom two bytes of the old %ebp we pushed at the start of the function. Then we pop that corrupted value back into the %ebp register via the leave. This has no effect until our caller calls leave itself. We move %ebp into %esp, then attempt to pop from the top of this stack into %ebp again. As it happens, the memory pointed to by the corrupt %ebp is zeroed; thus, we end up setting %ebp to zero. Finally, our caller does a ret, which pops another zero, but this time into %eip. Naturally this isn't a happy state of affairs, and we find ourselves with a core dump as described earlier.

Presumably this bug happened in the first place because someone didn't notice that domid_t was a 16-bit quantity. What's amazing is that nobody else has been hitting this problem!