6616864 amd64 syscall handler needs fixing for xen 3.1.1
Shortly after the release of 3.1.1, we discovered that all 64-bit processes in a Solaris domain would segfault immediately. After much debugging and head-scratching, I eventually found the problem. On AMD64, 64-bit processes trap into the kernel via the syscall instruction. Under Xen, this will obviously trap to the hypervisor. Xen then 'bounces' this back to the relevant OS kernel.
On real hardware, %rcx and %r11 have specific meanings. Prior to 3.1.1, Xen happened to maintain these values correctly, although the layout of the stack is very different from real hardware. This was broken in the 3.1.1 release: as a result, the %rflags of each process was corrupted, and segfaulted almost immediately. We fixed the bug in Solaris, so we would still work with 3.1.1. This was also fixed (restoring the original semantics) in Xen itself in time for the 3.1.2 release. So there's a small window (early Solaris xVM releases and community versions of Xen 3.1.1) where we're broken, but thankfully, we caught this pretty early. The lesson to be drawn? Clear documentation of the hypervisor ABI would have helped, I think.
6618391 64-bit xVM lets processes fiddle with kernelspace, but Xen bug saves us
Around the same time, I noticed during code inspection that we were still setting PT_USER in PTE entries on 64-bit. This had some nasty implications, but first, some background.
On 32-bit x86, Xen protects itself via segmentation: it carves out the top 64Mb, and refuses to let any of the domains load a segment selector that allows read or write access to that part of the address space. Each domain kernel runs in ring 1 so can't get around this. On 64-bit, this hack doesn't work, as AMD64 does not provide full support for segmentation (given what a legacy technique it is). Instead, and somewhat unfortunately, we have to use page-based permissions via the VM system. Since page table entries only have a single bit ("user/supervisor") instead of being able to say "ring 1 can read, but ring 3 cannot", the OS kernel is forced into ring 3. Normally, ring 3 is used for userspace code. So every time we switch between the OS kernel and userspace, we have to switch page tables entirely - otherwise, the process could use the kernel page tables to write to kernel address-space.
Unfortunately, this means that we have to flush the TLB every time, which has a nasty performance cost. To help mitigate this problem, in Xen 3.0.3, an incompatible change was made. Previously, so that the kernel (running in ring 3, remember) could access its address space, it had to set PT_USER int its kernel page table entries (PTEs). With 3.0.3, this was changed: now, the hypervisor would automatically do that. Furthermore, if Xen did see a PTE with PT_USER set, then it assumed this was a userspace mapping. Thus, it also set PT_GLOBAL, a hardware feature - if such a bit is set, then a corresponding TLB entry is not flushed. This meant that switching between userspace and the OS kernel was much faster, as the TLB entries for userspace were no longer flushed.
Unfortunately, in our kernel, we missed this change in some crucial places, and until we fixed the bug above, we were setting PT_USER even on kernel mappings. This was fairly obviously A Bad Thing: if you caught things just right, a kernel mapping would still be present in the TLB when a user-space program was running, allowing userspace to read from the kernel! And indeed, some simple testing showed this:
dtrace -qn 'fbt:genunix::entry /arg0 > `kernelbase/ { printf("%p ", arg0); }' | \ xargs -n 1 ~johnlev/bin/i386/readkern | while read ln; do echo $ln::whatis | mdb -k ; done
With the above use of DTrace, MDB, and a little program that attempts to read addresses, we can see output such as:
ffffff01d6f09c00 is ffffff01d6f09c00+0, allocated as a thread structure ffffff01c8c98438 is ffffff01c8c983e8+50, bufctl ffffff01c8ebf8d0 allocated from as_cache ffffff01d6f09c00 is ffffff01d6f09c00+0, allocated as a thread structure ffffff01d44d7e80 is ffffff01d44d7e80+0, bufctl ffffff01d3a2b388 allocated from kmem_alloc_40 ffffff01d44d7e80 is ffffff01d44d7e80+0, bufctl ffffff01d3a2b388 allocated from kmem_alloc_40
Thankfully, the fix was simple: just stop adding PT_USER to our kernel PTE entries. Or so I thought. When I did that, I noticed during testing that the userspace mappings weren't getting PT_GLOBAL set after all (big thanks to MDB's ::vatopfn, which made this easy to see).
Yet more investigation revealed the problem to be in the hypervisor. Unlike certain other popular OSes used with Xen, we set PTE entries in page tables using atomic compare and swap operations. Remember that under Xen, page tables are read-only to ensure safety. When an OS kernel tries to write a PTE, a page fault happens in Xen. Xen recognises the write as an attempt to update a PTE and emulates it. However, since it hadn't been tested, this emulation path was broken: it wasn't doing the correct mangling of the PTE entry to set PT_GLOBAL. Once again, the actual fix was simple.
By the way, that same putback also had the implementation of:
6612324 ::threadlist could identify taskq threads
I'd been doing an awful lot of paging through ::threadlist output recently, and always having to jump through all the (usually irrelevant) taskq threads was driving me insane. So now you can just specify ::threadlist -t and get a much, much, shorter list.