Solaris Xen update


After an undesirably long time, I'm happy to say that another drop of Solaris on Xen is available here. Sources and other sundry parts are here. Documentation can be found at our community site, and you can read Chris Beal describe how to get started with the new bits.

As you might expect, there's been a massive amount of change since the last OpenSolaris release. This time round, we are based on Xen 3.0.4 and build 66 of Nevada. As always, we'd love to hear about your experiences if you try it out, either on the mailing list or the IRC channel.

In many ways, the most significant change is the huge effort we've put in to stabilize our codebase; a significant number of potential hangs, crashes, and core dumps have been resolved, and we hope we're converging on a good-quality release. We've started looking seriously at performance issues, and filling in the implementation gaps. Since the last drop, notable improvements include:

PAE support
By default, we now use PAE mode on 32-bit, aiding compatibility with other domain 0 implementations; we also can boot under either PAE or non-PAE, if the Xen version has 'bi-modal' support. This has probably been the most-requested change missing from our last release.
HVM support
If you have the right CPU, you can now run fully-virtualized domains such as Windows using a Solaris dom0! Whilst more work is needed here, this does seem to work pretty well already. Mark Johnson has some useful tips on using HVM domains.
New management tools
We have integrated the virt- suite of management tools. virt-manager provides a simple GUI for controlling guest domains on a single host. virt-install and virsh are simple CLIs for installing and managing guest domains respectively. Note that parts of these tools are pre-alpha, and we still have a significant amount of work to do on them. Nonetheless, we appreciate any comments...
PV framebuffer
Solaris dom0 now supports the SDL-based paravirt framebuffer backend, which can be used with domUs that have PV framebuffer support.
Virtual NIC support
The Ethernet bridge used in the previous release has been replaced with virtual NICs from the Crossbow project. This enables future work around smart NICs, resource controls, and more.
Simplified Solaris guest domain install
It's now easy to install a new Solaris guest domain using the DVD ISO. The temporary tool in the last release, vbdcfg, has disappeared now as a result. William Kucharski has a walk-through.
Better SMF usage
Several of the xend configuration properties are now controlled using the SMF framework.
Managed domain support
We now support xend-managed domain configurations instead of using .py configuration files. Certain parts of this don't work too well yet (unfortunately all versions of Xen have similar problems), but we are plugging in the gaps here one by one.
Memory ballooning support
Otherwise known as support for dynamic xm mem-set, this allows much greater flexibility in partitioning the physical memory on a host amongst the guest domains. Ryan Scott has more details.
Vastly improved debugging support
Crash dump analysis and debugging tools have always been a critical feature for Solaris developers. With this release, we can use Solaris tools to debug both hypervisor crashes and problems with guest domains. I talk a little bit about the latter feature below.
xvbdb has been renamed
To simply be xdb. This was a very exciting change for certain members of our team.

We're still working hard on finishing things up for our phase 2 putback into Nevada (where "phase 1" was the separate dboot putback). As well as finishing this work, we're starting to look at further enhancements, in particular some features that are available in other vendors' implementations, such as a hypervisor-copy based networking device, blktap support, para-virtualized drivers for HVM domains (a huge performance fix), and more.

Debugging guest domains

Here I'll talk a little about one of the more minor new features that has nonetheless proven very useful. The xm dump-core command generates an image file of a running domain. This file is a dump of all memory owned by the running domain, so it's somewhat similar to the standard Solaris crash dump files. However, dump-core does not require any interaction with the domain itself, so we can grab such dumps even if the domain is unable to create a crash dump via the normal method (typically, it hangs and can't be interacted with), or something else prevents use of the standard Solaris kernel debugging facilities such as kmdb (an in-kernel debugger isn't very useful if the console is broken).

However, this also means that we have no control over the format used by the image file. With Xen 3.0.4, it's rather basic and difficult to work with. This is much improved in Xen 3.1, but I haven't yet written the support for the new format.

To add support for debugging such image files of a Solaris domain, I modified mdb(1) to understand the format of the image file (the alternative, providing a conversion step, seemed unneccessarily awkward, and would have had to throw away information!). As you can see if you look around usr/src/cmd/mdb in the source drop, mdb(1) loads a module called mdb_kb when debugging such image files. This provides simple methods for reading data from the image file. For example, to read a particular virtual address, we need to use the contents of the domain's page tables in the image file to resolve it to a physical page, then look up the location of that page in the file. This differs considerably from how libkvm works with Solaris crash dumps: there, we have a big array of address translations, which is used directly, instead of the page table contents.

In most other respects, debugging a kernel domain image is much the same as a crash dump:

# xm dump-core solaris-domu core.domu
# mdb core.domu
mdb: warning: dump is from SunOS 5.11 onnv-johnlev; dcmds and macros may not match kernel implementation
Loading modules: [ unix genunix specfs dtrace xpv_psm scsi_vhci ufs ... sppp ptm crypto md fcip logindmux nfs ]
> ::status
debugging domain crash dump core.domu (64-bit) from sxc16
operating system: 5.11 onnv-johnlev (i86pc)
> ::cpuinfo
 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  0 fffffffffbc4b7f0  1b   40    9 169  yes   yes t-1408926 ffffff00010bfc80 sched
> ::evtchns
Type          Evtchn IRQ IPL CPU ISR(s)
evtchn        1      257 1   0   xenbus_intr
evtchn        2      260 9   0   xenconsintr
virq:debug    3      256 15  0   xen_debug_handler
virq:timer    4      258 14  0   cbe_fire
evtchn        5      259 5   0   xdf_intr
evtchn        6      261 6   0   xnf_intr
evtchn        7      262 6   0   xnf_intr
> ::cpustack -c 0
cbe_fire+0x5c()
av_dispatch_autovect+0x8c(102)
dispatch_hilevel+0x1f(102, 0)
switch_sp_and_call+0x13()
do_interrupt+0x11d(ffffff00010bfaf0, fffffffffbc86f98)
xen_callback_handler+0x42b(ffffff00010bfaf0, fffffffffbc86f98)
xen_callback+0x194()
av_dispatch_softvect+0x79(a)
dispatch_softint+0x38(9, 0)
switch_sp_and_call+0x13()
dosoftint+0x59(ffffff0001593520)
do_interrupt+0x140(ffffff0001593520, fffffffffbc86048)
xen_callback_handler+0x42b(ffffff0001593520, fffffffffbc86048)
xen_callback+0x194()
sti+0x86()
_sys_rtt_ints_disabled+8()
intr_restore+0xf1()
disp_lock_exit+0x78(fffffffffbd1b358)
turnstile_wakeup+0x16e(fffffffec33a64d8, 0, 1, 0)
mutex_vector_exit+0x6a(fffffffec13b7ad0)
xenconswput+0x64(fffffffec42cb658, fffffffecd6935a0)
putnext+0x2f1(fffffffec42cb3b0, fffffffecd6935a0)
ldtermrmsg+0x235(fffffffec42cb2b8, fffffffec3480300)
ldtermrput+0x43c(fffffffec42cb2b8, fffffffec3480300)
putnext+0x2f1(fffffffec42cb560, fffffffec3480300)
xenconsrsrv+0x32(fffffffec42cb560)
runservice+0x59(fffffffec42cb560)
queue_service+0x57(fffffffec42cb560)
stream_service+0xdc(fffffffec42d87b0)
taskq_d_thread+0xc6(fffffffec46ac8d0)
thread_start+8()

Note that both ::cpustack and ::cpuregs are capable of using the actual register set at the time of the dump (since the hypervisor needs to store this for scheduling purposes). You can also see the ::evtchns dcmd in action here; this is invaluable for debugging interrupt problems (and we've fixed a lot of those over the past year or so!).

Currently, mdb_kb only has support for image files of para-virtualized Solaris domains. However, that's not the only interesting target: in particular, we could support mdb in live crash dump mode against a running Solaris domain, which opens up all sorts of interesting debugging possibilities. With a small tweak to Solaris, we can support debugging of fully-virtualized Solaris instances. It's not even impossible to imagine adding Linux kernel support to mdb(1), though it's hard to imagine there would be a large audience for such a feature...

Tags: