When Things Explode
Mar 18, 2008When Things Explode is a new site
about Manchester music, although it’s currently only a blog.
Looks like it’ll be good.
When Things Explode is a new site
about Manchester music, although it’s currently only a blog.
Looks like it’ll be good.
I’ve moved my blog to blogspot.com, so I can add pictures and the like without having to bugger about too much as I did on Advogato.
Just back from the US. Somehow I spent 3 weeks there last time without noticing my hotel was practically next to Rasputin Music. Anyway, I finally did this time on Saturday morning and thankfully had time to do some shopping. At the exchange rate, the 2-disc 12" of Underworld’s Jumbo was a particularly good buy, though I was most excited over this:
That’s the 12" of Trash. Yes, I went all the way to the US to get a record by a local Manchester band. I missed it first time around, and they’re all bought up over here, so I think that’s why I didn’t own it yet. But it’s a great tune.
The daemon xenstored runs in dom0 userspace, and implements a simple 'store' of configuration information. This store is used for storing parameters used by running guest domains, and interacts with dom0, guest domains, qemu, xend, and others. These interactions can easily get pretty complicated as a result, and visualizing how requests and responses are connected can be non-obvious.
The existing community solution was a 'trace' option to xenstored: you could restart the daemon and it would record every operation performed. This worked reasonably well, but was very awkward: restarting xenstored means a reboot of dom0 at this point in time. By the time you've set up tracing, you might not be able to reproduce whatever you're looking at any more. Besides, it's extremely inconvenient.
It was obvious that we needed to make this dynamic, and DTrace USDT (Userspace Statically Defined Tracing) was the obvious choice. The patch adds a couple of simple probes for tracking requests and responses; as usual, they're activated dynamically, so have (next to) zero impact when they're not used. On top of these probes I wrote a simple script called xenstore-snoop. Here's a couple of extracts of the output I get when I start a guest domain:
# /usr/lib/xen/bin/xenstore-snoop DOM PID TX OP 0 100313 0 XS_GET_DOMAIN_PATH: 6 -> /local/domain/6 0 100313 0 XS_TRANSACTION_START: -> 930 0 100313 930 XS_RM: /local/domain/6 -> OK 0 100313 930 XS_MKDIR: /local/domain/6 -> OK ... 6 0 0 XS_READ: /local/domain/0/backend/vbd/6/0/state -> 4 6 0 0 XS_READ: device/vbd/0/state -> 3 0 0 - XS_WATCH_EVENT: /local/domain/6/device/vbd/0/state FFFFFF0177B8F048 6 0 - XS_WATCH_EVENT: device/vbd/0/state FFFFFF00C8A3A550 6 0 0 XS_WRITE: device/vbd/0/state 4 -> OK 0 0 0 XS_READ: /local/domain/6/device/vbd/0/state -> 4 6 0 0 XS_READ: /local/domain/0/backend/vbd/6/0/feature-barrier -> 1 6 0 0 XS_READ: /local/domain/0/backend/vbd/6/0/sectors -> 16777216 6 0 0 XS_READ: /local/domain/0/backend/vbd/6/0/info -> 0 6 0 0 XS_READ: device/vbd/0/device-type -> disk 6 0 0 XS_WATCH: cpu FFFFFFFFFBC2BE80 -> OK 6 0 - XS_WATCH_EVENT: cpu FFFFFFFFFBC2BE80 6 0 0 XS_READ: device/vif/0/state -> 1 6 0 0 [ERROR] XS_READ: device/vif/0/type -> ENOENT ...
This makes the interactions immediately obvious. We can observe the Xen domain that's doing the request, the PID of the process (this only applies to dom0 control tools), the transaction ID, and the actual operations performed. This has already proven of use in several investigations.
Of course this being DTrace, this is only part of the story. We can use these probes to correlate system behaviour: for example, xenstored transactions are currently rather heavyweight, as they involve copying a large file; these probes can help demonstrate this. Using Python's DTrace support, we can look at which stack traces in xend correspond to which requests to the store; and so on.
This feature, whilst relatively minor, is part of an ongoing plan to improve the observability and RAS of Xen and the solutions Sun are building on top of it. It's very important to us to bring Solaris's excellent observability features to the virtualization space: you've seen the work with zones in this area, and you can expect a lot more improvements for the Xen case too.
I meant to say: after my previous post, I resurrected #opensolaris-dev: if you'd like to talk about OpenSolaris development in a non-hostile environment, please join!
Recently I'm sad to say #opensolaris has become a really hostile, unpleasant place. I've seen new people arrive and be bullied by a small number of poisonous people until they went away (nice own goal, people!). So if anyone's looking for me for xVM stuff or whatever, I'll be in #onnv-scm or #solaris-xen as usual. And if you do so, please try to keep a civil tongue in your head - it's not hard.
Shortly after the release of 3.1.1, we discovered that all 64-bit processes in a Solaris domain would segfault immediately. After much debugging and head-scratching, I eventually found the problem. On AMD64, 64-bit processes trap into the kernel via the syscall instruction. Under Xen, this will obviously trap to the hypervisor. Xen then 'bounces' this back to the relevant OS kernel.
On real hardware, %rcx and %r11 have specific meanings. Prior to 3.1.1, Xen happened to maintain these values correctly, although the layout of the stack is very different from real hardware. This was broken in the 3.1.1 release: as a result, the %rflags of each process was corrupted, and segfaulted almost immediately. We fixed the bug in Solaris, so we would still work with 3.1.1. This was also fixed (restoring the original semantics) in Xen itself in time for the 3.1.2 release. So there's a small window (early Solaris xVM releases and community versions of Xen 3.1.1) where we're broken, but thankfully, we caught this pretty early. The lesson to be drawn? Clear documentation of the hypervisor ABI would have helped, I think.
Around the same time, I noticed during code inspection that we were still setting PT_USER in PTE entries on 64-bit. This had some nasty implications, but first, some background.
On 32-bit x86, Xen protects itself via segmentation: it carves out the top 64Mb, and refuses to let any of the domains load a segment selector that allows read or write access to that part of the address space. Each domain kernel runs in ring 1 so can't get around this. On 64-bit, this hack doesn't work, as AMD64 does not provide full support for segmentation (given what a legacy technique it is). Instead, and somewhat unfortunately, we have to use page-based permissions via the VM system. Since page table entries only have a single bit ("user/supervisor") instead of being able to say "ring 1 can read, but ring 3 cannot", the OS kernel is forced into ring 3. Normally, ring 3 is used for userspace code. So every time we switch between the OS kernel and userspace, we have to switch page tables entirely - otherwise, the process could use the kernel page tables to write to kernel address-space.
Unfortunately, this means that we have to flush the TLB every time, which has a nasty performance cost. To help mitigate this problem, in Xen 3.0.3, an incompatible change was made. Previously, so that the kernel (running in ring 3, remember) could access its address space, it had to set PT_USER int its kernel page table entries (PTEs). With 3.0.3, this was changed: now, the hypervisor would automatically do that. Furthermore, if Xen did see a PTE with PT_USER set, then it assumed this was a userspace mapping. Thus, it also set PT_GLOBAL, a hardware feature - if such a bit is set, then a corresponding TLB entry is not flushed. This meant that switching between userspace and the OS kernel was much faster, as the TLB entries for userspace were no longer flushed.
Unfortunately, in our kernel, we missed this change in some crucial places, and until we fixed the bug above, we were setting PT_USER even on kernel mappings. This was fairly obviously A Bad Thing: if you caught things just right, a kernel mapping would still be present in the TLB when a user-space program was running, allowing userspace to read from the kernel! And indeed, some simple testing showed this:
dtrace -qn 'fbt:genunix::entry /arg0 > `kernelbase/ { printf("%p ", arg0); }' | \ xargs -n 1 ~johnlev/bin/i386/readkern | while read ln; do echo $ln::whatis | mdb -k ; done
With the above use of DTrace, MDB, and a little program that attempts to read addresses, we can see output such as:
ffffff01d6f09c00 is ffffff01d6f09c00+0, allocated as a thread structure ffffff01c8c98438 is ffffff01c8c983e8+50, bufctl ffffff01c8ebf8d0 allocated from as_cache ffffff01d6f09c00 is ffffff01d6f09c00+0, allocated as a thread structure ffffff01d44d7e80 is ffffff01d44d7e80+0, bufctl ffffff01d3a2b388 allocated from kmem_alloc_40 ffffff01d44d7e80 is ffffff01d44d7e80+0, bufctl ffffff01d3a2b388 allocated from kmem_alloc_40
Thankfully, the fix was simple: just stop adding PT_USER to our kernel PTE entries. Or so I thought. When I did that, I noticed during testing that the userspace mappings weren't getting PT_GLOBAL set after all (big thanks to MDB's ::vatopfn, which made this easy to see).
Yet more investigation revealed the problem to be in the hypervisor. Unlike certain other popular OSes used with Xen, we set PTE entries in page tables using atomic compare and swap operations. Remember that under Xen, page tables are read-only to ensure safety. When an OS kernel tries to write a PTE, a page fault happens in Xen. Xen recognises the write as an attempt to update a PTE and emulates it. However, since it hadn't been tested, this emulation path was broken: it wasn't doing the correct mangling of the PTE entry to set PT_GLOBAL. Once again, the actual fix was simple.
By the way, that same putback also had the implementation of:
I'd been doing an awful lot of paging through ::threadlist output recently, and always having to jump through all the (usually irrelevant) taskq threads was driving me insane. So now you can just specify ::threadlist -t and get a much, much, shorter list.
You might be wondering if your machine is capable of running Windows or other operating systems under HVM. Joe Bonasera has a simple program you can run that will tell you. Alternatively, if you're already running with our bits, running 'virt-install' will tell you - if it asks you about creating a fully-virtualized domain, then it should work, and you can end up with a desktop like Russell Blaine's.
Nils, meanwhile, describes how we've improved the RAS of the hypervisor by integrating it with Solaris crash dumps here. This feature has saved our lives numerous times during development as those of us who've done the "hex dump" debugging thing know very well.
Of course, we're not done yet - we have bugs to fix and rough edges to smooth out, and we have significant features to implement. One of the major items we're working on in the near future is the upgrade to Xen 3.1.1 (or possibly 3.1.2, depending on timelines!). This will give us the ability to do live migration of HVM domains, along with a host of other features and improvements.
Bernd refers to an older method of auto-starting Xen domains used on Linux. In fact, this method has been replaced with the configuration parameters on_xend_start and on_xend_stop. Setting these can ensure that a Xen domain is cleanly shut down when the host (dom0) is shut down, and started automatically as needed. For somewhat obvious reasons, we'd like to have the same semantics as used with zones, if not quite the same implementation (yet, at least).
When I started looking at this, I realised that the community solution had some problems:
It seems obvious that by default I'd like my operating systems to shut down cleanly. Only in unusual circumstances would I be happy with an OS being unceremoniously destroyed. We modified our Xen gate to default to on_xend_stop=shutdown.
It is possible to specify on_xend_stop=suspend; this will save the running state to an image file and then destroy the domain (like xm save). However, there is not corresponding on_xend_start setting, nor any logic to ensure that the values match. This is both apparently useless and even dangerous, since starting a new domain but with old file-system state from a suspended domain could be problematic. We've disabled this functionality.
This was the biggest problem for us: as modelled, if somebody stops xend, then all the domains would be shut down. Similarly, if xend restarts for whatever reason (say, a hardware error), it would start domains again. We've modified this on Solaris. Instead of xend operating on these values, we introduce a new SMF service, system/xctl/domains, that auto-starts/stops domains as necessary. This service is pretty similar to system/zones. We've set up the dependencies such that a restart of the Xen daemons won't cause any running domains to be restarted. For this to work properly within the SMF framework, we also had to modify xend to wait for all domains to finish their state transitions.
You can find our changes here. And yes, we still need to take system/xctl/domains to PSARC.
You might be wondering how the dom0 even asks the guest domains to shut down cleanly. This is done via a xenstore entry, control/shutdown. The control tools write a string into this entry, which is being "watched" by the domain. The kernel then reads the value and responds appropriately (xen_shutdown()), triggering a user-space script via the sysevent framework. If nothing happens for a while, it's possible that the script couldn't run for whatever reason. In that case, we time-out and force a "dirty" shutdown from within the kernel.
As you might expect, there's been a massive amount of change since the last OpenSolaris release. This time round, we are based on Xen 3.0.4 and build 66 of Nevada. As always, we'd love to hear about your experiences if you try it out, either on the mailing list or the IRC channel.
In many ways, the most significant change is the huge effort we've put in to stabilize our codebase; a significant number of potential hangs, crashes, and core dumps have been resolved, and we hope we're converging on a good-quality release. We've started looking seriously at performance issues, and filling in the implementation gaps. Since the last drop, notable improvements include:
We're still working hard on finishing things up for our phase 2 putback into Nevada (where "phase 1" was the separate dboot putback). As well as finishing this work, we're starting to look at further enhancements, in particular some features that are available in other vendors' implementations, such as a hypervisor-copy based networking device, blktap support, para-virtualized drivers for HVM domains (a huge performance fix), and more.
Here I'll talk a little about one of the more minor new features that has nonetheless proven very useful. The xm dump-core command generates an image file of a running domain. This file is a dump of all memory owned by the running domain, so it's somewhat similar to the standard Solaris crash dump files. However, dump-core does not require any interaction with the domain itself, so we can grab such dumps even if the domain is unable to create a crash dump via the normal method (typically, it hangs and can't be interacted with), or something else prevents use of the standard Solaris kernel debugging facilities such as kmdb (an in-kernel debugger isn't very useful if the console is broken).
However, this also means that we have no control over the format used by the image file. With Xen 3.0.4, it's rather basic and difficult to work with. This is much improved in Xen 3.1, but I haven't yet written the support for the new format.
To add support for debugging such image files of a Solaris domain, I modified mdb(1) to understand the format of the image file (the alternative, providing a conversion step, seemed unneccessarily awkward, and would have had to throw away information!). As you can see if you look around usr/src/cmd/mdb in the source drop, mdb(1) loads a module called mdb_kb when debugging such image files. This provides simple methods for reading data from the image file. For example, to read a particular virtual address, we need to use the contents of the domain's page tables in the image file to resolve it to a physical page, then look up the location of that page in the file. This differs considerably from how libkvm works with Solaris crash dumps: there, we have a big array of address translations, which is used directly, instead of the page table contents.
In most other respects, debugging a kernel domain image is much the same as a crash dump:
# xm dump-core solaris-domu core.domu # mdb core.domu mdb: warning: dump is from SunOS 5.11 onnv-johnlev; dcmds and macros may not match kernel implementation Loading modules: [ unix genunix specfs dtrace xpv_psm scsi_vhci ufs ... sppp ptm crypto md fcip logindmux nfs ] > ::status debugging domain crash dump core.domu (64-bit) from sxc16 operating system: 5.11 onnv-johnlev (i86pc) > ::cpuinfo ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 0 fffffffffbc4b7f0 1b 40 9 169 yes yes t-1408926 ffffff00010bfc80 sched > ::evtchns Type Evtchn IRQ IPL CPU ISR(s) evtchn 1 257 1 0 xenbus_intr evtchn 2 260 9 0 xenconsintr virq:debug 3 256 15 0 xen_debug_handler virq:timer 4 258 14 0 cbe_fire evtchn 5 259 5 0 xdf_intr evtchn 6 261 6 0 xnf_intr evtchn 7 262 6 0 xnf_intr > ::cpustack -c 0 cbe_fire+0x5c() av_dispatch_autovect+0x8c(102) dispatch_hilevel+0x1f(102, 0) switch_sp_and_call+0x13() do_interrupt+0x11d(ffffff00010bfaf0, fffffffffbc86f98) xen_callback_handler+0x42b(ffffff00010bfaf0, fffffffffbc86f98) xen_callback+0x194() av_dispatch_softvect+0x79(a) dispatch_softint+0x38(9, 0) switch_sp_and_call+0x13() dosoftint+0x59(ffffff0001593520) do_interrupt+0x140(ffffff0001593520, fffffffffbc86048) xen_callback_handler+0x42b(ffffff0001593520, fffffffffbc86048) xen_callback+0x194() sti+0x86() _sys_rtt_ints_disabled+8() intr_restore+0xf1() disp_lock_exit+0x78(fffffffffbd1b358) turnstile_wakeup+0x16e(fffffffec33a64d8, 0, 1, 0) mutex_vector_exit+0x6a(fffffffec13b7ad0) xenconswput+0x64(fffffffec42cb658, fffffffecd6935a0) putnext+0x2f1(fffffffec42cb3b0, fffffffecd6935a0) ldtermrmsg+0x235(fffffffec42cb2b8, fffffffec3480300) ldtermrput+0x43c(fffffffec42cb2b8, fffffffec3480300) putnext+0x2f1(fffffffec42cb560, fffffffec3480300) xenconsrsrv+0x32(fffffffec42cb560) runservice+0x59(fffffffec42cb560) queue_service+0x57(fffffffec42cb560) stream_service+0xdc(fffffffec42d87b0) taskq_d_thread+0xc6(fffffffec46ac8d0) thread_start+8()
Note that both ::cpustack and ::cpuregs are capable of using the actual register set at the time of the dump (since the hypervisor needs to store this for scheduling purposes). You can also see the ::evtchns dcmd in action here; this is invaluable for debugging interrupt problems (and we've fixed a lot of those over the past year or so!).
Currently, mdb_kb only has support for image files of para-virtualized Solaris domains. However, that's not the only interesting target: in particular, we could support mdb in live crash dump mode against a running Solaris domain, which opens up all sorts of interesting debugging possibilities. With a small tweak to Solaris, we can support debugging of fully-virtualized Solaris instances. It's not even impossible to imagine adding Linux kernel support to mdb(1), though it's hard to imagine there would be a large audience for such a feature...
As is the case with the other providers people have worked on such as Ruby and Perl, there's two simple probes for function entry and function exit. arg0 contains the filename, arg1 the function name, and arg2 has the line number. So given this simple script to trace the functions called by a particular function invocation, restricted to a given module name:
#!/usr/sbin/dtrace -ZCs #pragma D option quiet python$target:::function-entry /copyinstr(arg1) == $2 && strstr(copyinstr(arg0), $1) != NULL/ { self->trace = 1; } python$target:::function-return /copyinstr(arg1) == $2 && strstr(copyinstr(arg0), $1) != NULL/ { self->trace = 0; } python$target:::function-entry,python$target:::function-return /self->trace && strstr(copyinstr(arg0), $3) != NULL/ { printf("%s %s (%s:%d)\n", probename == "function-entry" ? "->" : "<-", copyinstr(arg1), copyinstr(arg0), arg2); }
We can run it as follows and get some useful results:
# ./pytrace.d \"hg.py\" \"clone\" \"mercurial\" -c 'hg clone /tmp/test.hg' -> clone (build/proto/lib/python/mercurial/hg.py:65) -> repository (build/proto/lib/python/mercurial/hg.py:54) -> _lookup (build/proto/lib/python/mercurial/hg.py:31) -> _local (build/proto/lib/python/mercurial/hg.py:16) -> __getattribute__ (build/proto/lib/python/mercurial/demandload.py:56) -> module (build/proto/lib/python/mercurial/demandload.py:53) ...
Of course, this being DTrace, we can tie all of this into general system activity as usual. I also added "ustack helper" support. This is significantly more tricky to implement, but enormously useful for following the path of Python code. For example, imagine we want to look at what's causing write()s in the clone operation above. As usual:
#!/usr/sbin/dtrace -Zs syscall::write:entry /pid == $target/ { @[jstack(20)] = count(); } END { trunc(@, 2); }
Note that we're using jstack() to make sure we have enough space allocated for the stack strings reported. Now as well as the C stack, we can see what Python functions are involved in the user stack trace:
# ./writes.d -c 'hg clone /tmp/test.hg' ... libc.so.1`_write+0x15 libc.so.1`_fflush_u+0x36 libc.so.1`fflush+0x43 libpython2.4.so.1.0`file_flush+0x2a libpython2.4.so.1.0`call_function+0x32a libpython2.4.so.1.0`PyEval_EvalFrame+0xbdf [ build/proto/lib/python/mercurial/transaction.py:49 (add) ] libpython2.4.so.1.0`PyEval_EvalCodeEx+0x732 libpython2.4.so.1.0`fast_function+0x112 libpython2.4.so.1.0`call_function+0xda libpython2.4.so.1.0`PyEval_EvalFrame+0xbdf [ build/proto/lib/python/mercurial/revlog.py:1137 (addgroup) ] libpython2.4.so.1.0`PyEval_EvalCodeEx+0x732 libpython2.4.so.1.0`fast_function+0x112 libpython2.4.so.1.0`call_function+0xda libpython2.4.so.1.0`PyEval_EvalFrame+0xbdf [ build/proto/lib/python/mercurial/localrepo.py:1849 (addchangegroup) ] libpython2.4.so.1.0`PyEval_EvalCodeEx+0x732 libpython2.4.so.1.0`fast_function+0x112 libpython2.4.so.1.0`call_function+0xda libpython2.4.so.1.0`PyEval_EvalFrame+0xbdf [ build/proto/lib/python/mercurial/localrepo.py:1345 (pull) ] libpython2.4.so.1.0`PyEval_EvalCodeEx+0x732 libpython2.4.so.1.0`fast_function+0x112 148 libc.so.1`_write+0x15 libc.so.1`_fflush_u+0x36 libc.so.1`fclose+0x6e libpython2.4.so.1.0`file_dealloc+0x36 libpython2.4.so.1.0`frame_dealloc+0x65 libpython2.4.so.1.0`PyEval_EvalCodeEx+0x75c libpython2.4.so.1.0`fast_function+0x112 libpython2.4.so.1.0`call_function+0xda libpython2.4.so.1.0`PyEval_EvalFrame+0xbdf [ build/proto/lib/python/mercurial/localrepo.py:1849 (addchangegroup) ] libpython2.4.so.1.0`PyEval_EvalCodeEx+0x732 libpython2.4.so.1.0`fast_function+0x112 libpython2.4.so.1.0`call_function+0xda libpython2.4.so.1.0`PyEval_EvalFrame+0xbdf [ build/proto/lib/python/mercurial/localrepo.py:1345 (pull) ] libpython2.4.so.1.0`PyEval_EvalCodeEx+0x732 libpython2.4.so.1.0`fast_function+0x112 libpython2.4.so.1.0`call_function+0xda libpython2.4.so.1.0`PyEval_EvalFrame+0xbdf [ build/proto/lib/python/mercurial/localrepo.py:1957 (clone) ] libpython2.4.so.1.0`PyEval_EvalCodeEx+0x732 libpython2.4.so.1.0`fast_function+0x112 libpython2.4.so.1.0`call_function+0xda 148
As anyone who's come across the Java dtrace helper source will know, creating a ustack helper is rather a black art.
When a ustack helper is present, it is called in-kernel for each entry in a stack when the ustack() action occurs (source). The D instructions in the helper action are executed such that the final string value is taken as the result of the helper. Typically for Java, there is no associated C function symbol for the PC value at that point in the stack, so the result of the helper is used directly in the stack trace. However, this is not true for Python, so that's why you see a different format above: the normal stack entry, plus the result of the helper in annotated form where it returned a result (in square brackets).
The helper is given two arguments: arg0 is the PC value of the stack entry, and arg1 is the frame pointer. The helper is expected to construct a meaningful string from just those values. In Python, the PyEval_EvalFrame function always has a PyFrameObject * as one of its arguments. By having the helper look at this pointer value and dig around the structures, we can find pointers to the strings containing the file name and function, as well as the line number. We can copy these strings in, and, using alloca() to give ourselves some scratch space, build up the annotation string you see above.
Debugging helpers isn't particularly easy, since it lives and runs in probe context. You can use mdb's DTrace debugging facilities to find out what happened, and some careful mapping between the failing D instructions and the helper source can pinpoint the problem. Using this method it was relatively easy to get a working helper for x86 32-bit. Both SPARC and x86 64-bit proved more troublesome though. The problems were both related to the need to find the PyFrameObject * given the frame pointer. On amd64, the function we needed to trace was passing the arguments in registers, as defined architecturally, so the argument wasn't accessible on the stack via the frame pointer. On SPARC, the pointer we need was stored in a register that was subsequently re-used as a scratch register. Both problems were solved, rather cheesily, by modifying the way the function was called.
For dom0 to be able to boot a para-virtualised domU, it needs to be able to bootstrap it. In particular, it needs to be able to read the kernel file and its associated ramdisk so it can hand off control to the kernel's entry point when the domain is created. And we must somehow make these files accessible in the dom0. Previously, you had to somehow copy out the files from the domU filesystem into dom0. This was often difficult (consider getting files off an ext2 filesystem in a Solaris dom0), and was obviously prone to errors such as forgetting to update the copies when upgrading the kernel.
For a while now Xen has had support for a bootloader. This runs in userspace and is responsible for copying out the files (that specified by kernel and ramdisk in the domain's config file) to a temporary directory in dom0; the files are then passed on to the domain builder. Xen has shipped with a bootloader called pygrub. Whilst somewhat confusingly named, it essentially emulated the grub menu. It had backends for a couple of Linux filesystems written in Python and worked by searching for a grub.conf file, then presenting a lookalike grub menu for the user to interact with. When an entry was selected, the specified files would be read off the filesystem and passed back to the builder.
This worked reasonably well for Linux, but we felt there was a number of problems. First, the interactive menu only worked for first boot; subsequent reboots would automatically choose an entry without allowing user interaction (though this is now fixed in xen-unstable). Its interactive nature seemed quite a stumbling block for things like remote domain management; you really don't want to babysit domain creation. Also, the implementation of the filesystem backends wasn't ideal; there was only limited Linux filesystem support, and it didn't work very well.
We've adapted pygrub to help with some of these issues. First, we replaced the filesystem code with a C library called libfsimage. The intention here is to provide a stable API for accessing filesystem images from userspace. Thus it provides a simple interface for reading files from a filesystem image and a plugin architecture to provide the filesystem support. This plugin API is also stable, allowing filesystems past, present and future to be transparently supported. Currently there are plugins for ext2, reiserfs, ufs and iso9660, and we expect to have a zfs plugin soon. We borrowed the grub code for all of these plugins to simplify the implementation, but the API allows for any implementation.
Some people were suggesting solutions involving loopback mounts. This was problematic for us for two main reasons. First, filesystem support in the different dom0 OS's is far from complete; for example, Solaris has no ext2 support, and Linux has no (real) ZFS support. Second, and more seriously, it exposes a significant gap in terms of isolation: the dom0 kernel FS code must be entirely resilient against a corrupt domU filesystem image. If we are to consider domU's as untrusted, it doesn't make sense to leave this open as an attack vector.
Another simple change we made was to allow operation without a grub.conf at all. You can specify a kernel and ramdisk and make pygrub automatically load them from the domU filesystem. Even easier, you can leave out all configuration altogether, and a Solaris domU will automatically boot the correct kernel and ramdisk. This makes setting up your config for a domU much easier.
pygrub understands both fdisk partitions and Solaris slices, so simply specifying the disk will cause the bootloader to look for the root slice and grab the right files to boot.
There's more work we can do yet, of course.