A brief tour of i86xen

Feb 14, 2006
In this post, I'm going to give a quick walk through the major changes we've made so far in doing our port of Solaris to the Xen "platform". As we've only supplied a tarball of the source tree so far, I can't hyperlink to the relevant bits - sorry about that. As our code is still under heavy development, you can expect some of this code organisation to change significantly; nonetheless I thought this might be useful for those interested in peeking into the internals of what we've done so far.

As you might expect, the vast majority of the changes we've made reside in the kernel. To support booting Solaris under Xen (both domU and dom0, though as we've said the latter is still in the very early stages of development), we've introduced a new platform based on i86pc called i86xen. Wherever possible, we've tried to share common code by using i86pc's sources. There's still some cleanup we can do in this area.

Within usr/src/uts/i86xen, there are a number of Xen-specific source files:

io/psm/
Contains the PSM ("Platform-Specific Module") module for Xen. This mirrors the PSM provided by i86pc, but deals with the hypervisor-provided features such as the clock timer and the events system.
io/xendev/
This contains the virtual root nexus driver "xendev". All of the virtual frontend drivers are connected to this.
io/xvbd/
The virtual block driver. It's currently non-functional with the version of Xen we're working with; we're working hard on getting it functional.
os/
The guts of the kernel/hypervisor code. Amongst other things, it provides interfaces for dealing with events in evtchn.c and hypervisor_machdep.c (the hypervisor version of virtual interrupts, which hook into Solaris's standard interrupt system), the grant table in gnttab.c (used for providing access/transfer of pages between frontend and backend, suspend/resume in xen_machdep.c, and support routines for the debugger and the MMU code (mach_kdi.c and xen_mmu.c respectively).

As mentioned we use the i86pc code where possible, occasionally using #ifdefs where minor differences are found. In particular we re-use the i86pc HAT (MMU management) code found in i86pc/vm. You can also find code for the new boot method described by Joe Bonasera in i86pc/dboot and i86pc/boot.

A number of drivers that are needed by Xen but aren't i86xen specific live under usr/src/uts/common:

common/io/xenbus_*.c common/io/xenbus/
"xenbus" is a simple transport for configuration data provided by domain0; for example, it provides a node control/shutdown which will notify the domainU that the user has requested the domain to be shutdown (or suspended) from domain0's management tools. This code provides this support.
common/io/xencons/
The virtual console frontend driver.
common/io/xennetf/
The virtual net device frontend driver.

As you might expect, the userspace changes we've needed to make so far have been reasonably minimal. Despite supporting the new i86xen platform definition, the only significant changes have been to usr/src/cmd/mdb/, where we've added some changes to better support debugging of the Xen-style x86 MMU.

Tags:

Live migration of Solaris instances

Feb 14, 2006
Today we released our current source tree for our Solaris Xen port; for more details and the downloads see the Xen community on OpenSolaris.

One of the most useful features of Xen is its ability to package up a running OS instance (in Xen terminology, a "domainU", where "U" stands for "unprivileged"), plus all of its state, and take it offline, to be resumed at a later time. Recently we performed the first successful live migration of a running Solaris instance between two machines. In this blog I'll cover the various ways you can do this.

Para-virtualisation of the MMU

Typical "full virtualisation" uses a method known as "shadow page tables", whereby two sets of pagetables are maintained: the guest domain's set, which aren't visible to the hardware via cr3, and page tables visible to the hardware which are maintained by the hypervisor. As only the hypervisor can control the page tables the hardware uses to resolve TLB misses, it can maintain the virtualisation of the address space by copying and validating any changes the guest domain makes to its copies into the "real" page tables.

All these duplicates pages come at a cost of course. A para-virtualisation approach (that is, one where the guest domain is aware of the virtualisation and complicit in operating within the hypervisor) can take a different tack. In Xen, the guest domain is made aware of a two-level address system. The domain is presented with a linear set of "pseudo-physical" addresses comprising the physical memory allocated to the domain, as well as the "machine" addresses for each corresponding page. The machine address for a page is what's used in the page tables (that is, it's the real hardware address). Two tables are used to map between pseudo-physical and machine addresses. Allowing the guest domain to see the real machine address for a page provides a number of benefits, but slightly complicates things, as we'll see.

Save/Restore

The simplest form of "packaging" a domain is suspending it to a file in the controlling domain (a privileged OS instance known as "domain 0"). A running domain can be taken offline via an xm save command, then restored at a later time with xm restore, without having to go through a reboot cycle - the domain state is fully restored.

xm save xen-7 /tmp/domain.img

An xm save notifies the domain to suspend itself. This arrives via the xenbus watch system on the node control/shutdown, and is handled via xen_suspend_domain(). This is actually remarkably simple. First we leverage Solaris's existing suspend/resume subsystem, CPR, to iterate through the devices attached to the domain's device nexus. This calls each of the virtual drivers we use (the network, console, and block device frontends) with a DDI_SUSPEND argument. The virtual console, for example, simply removes its interrupt handler in xenconsdetach(). As a guest domain, this tears down the Xen event channel used to communicate with the console backend. The rest of the suspend code deals with tearing down some of the things we use to communicate with the hypervisor and domain 0, such as the grant table mappings. Additionally we convert a couple of stored MFN (the frame numbers of machine addresses) values into pseudo-physical PFNs. This is because the MFNs are free to change when we restore the guest domain; as the PFNs aren't "real", they will stay the same. Finally we call HYPERVISOR_suspend() to call into the hypervisor and tell it we're ready to be suspended.

Now the domain 0 management tools are ready to checkpoint the domain to the file we specified in the xm save command. Despite the name, this is done via xc_linux_save(). Its main task is to convert any MFN values that the domain still has into PFN values, then write all its pages to the disk. These MFN values are stored in two main places; the PFN->MFN mapping table managed by the domain, and the actual pages of the page tables.

During boot, we identified which pages store the PFN->MFN table (see xen_relocate_start_info()), and pointed to that structure in the "shared info" structure, which is shared between the domain and the hypervisor. This is used to map the table in xc_linux_save().

The hypervisor keeps track of which pages are being used as page tables. Thus, after domain 0 has mapped the guest domain's pages, we write out the page contents, but modify any pages that are identified as page tables. This is handled by canonicalize_pagetable(); this routine replaces all PTE entries that contain MFNs with the corresponding PFN value.

There are a couple of other things that need to be fixed too, such as the GDT.

xm restore /tmp/domain.img

Restoring a domain is essentially the reverse operation: the data for each page is written into one of the machine addresses reserved for the "new" domain; if we're writing a saved page table, we replace each PTE's PFN value with the new MFN value used by the new instance of the domain.

Eventually the restored domain is given back control, coming out from the HYPERVISOR_suspend() call. Here we need to rebuild the event channel setup, and anything else we tore down before suspending. Finally, we return back from the suspend handler and continue on our merry way.

Migration

xm migrate xen-7 remotehost

A normal save/restore cycle happens on the same machine, but migrating a domain to a separate machine is a simple extension of the process. Since our save operation has replaced any machine-specific frame number value with the pseudo-physical frames, we can easily do the restore on a remote machine, even though the actual hardware pages given to the domainU will be different. The remote machine must have the Xen daemon listening on the HTTP port, which is a simple change in its config file. Instead of writing each page's contents to a file, we can transmit it across HTTP to the Xen daemon running on a remote machine. The restore is done on that machine in the same manner as described above.

Live Migration

xm migrate --live xen-7 remotehost

The real magic happens with live migration, which keeps the time the domain isn't kept running to a bare minimum (on the order of milliseconds). Live migration relies on the empirically observed data that an OS instance is unlikely to modify a large percentage of its pages within a certain time frame; thus, by iteratively copying over modified domain pages, we'll eventually reach a point where the remaining data to be copied is small enough that the actual downtime for a domainU is minimal.

In operation, the domain is switched to use a modified form of the shadow page tables described above, known as "log dirty" mode. In essence, a shadow page table is used to notify the hypervisor if a page has been written to, by keeping the PTE entry for the page read-only: an attempt to write to the page causes a page fault. This page fault is used to mark the domain page as "dirty" in a bitmap maintained by the hypervisor, which then fixes up the domain's page fault and allows it to continue.

Meanwhile, the domain management tools iteratively transfers unmodified pages to the remote machine. It reads the dirty page bitmap and re-transmits any page that has been modified since it was last sent, until it reaches a point where it can finally tell the domain to suspend, and switch over to running it on the remote machine. This process is described in more detail in Live Migration of Virtual Machines.

Whilst transmitting all the pages takes a while, the actual time between suspension and resume is typically very small. Live migration is pretty fun to watch happen; you can be logged into the domain over ssh and not even notice that the domain has migrated to a different machine.

Further Work

Whilst live migration is currently working for our Solaris changes, there's still a number of improvements and fixes that need to be made.

On x86, we usually use the TSC register as the basis for a high-resolution timer (heavily used by the microstate accounting subsystem). We don't directly use any virtualisation of the TSC value, so when we restore a domain, we can see a large jump in the value, or even see it go backwards. We handle this OK (once we fixed bug 6228819 in our gate!), but don't yet properly handle the fact that the relationship between TSC ticks and clock frequency can change between a suspend and resume. This screws up our notion of timing.

We don't make any effort to release physical pages that we're not currently using. This makes suspend/resume take longer than it should, and it's probably worth investigating what can be done here.

Currently many hardware-specific instructions and features are enabled at boot by patching in instructions if we discover the CPU supports it. For example we discovered a domain that died badly when it was migrated to a host that didn't support the sfence instruction. If such a kernel is migrated to a machine with different CPUs, the domain will naturally fail badly. We need to investigate preventing incompatible migrations (the standard Xen tools currently do no verification), and also look at whether we can adapt to some of these changes when we resume a domain.

Tags:

Generating assembly structure offset values with CTF

Jan 17, 2006
The Solaris kernel contains a fair amount of assembly, and this often needs to access C structures (and in particular know the size of such structures, and the byte offsets of their members). Since the assembler can't grok C, we need to provide constant values for it to use. This also applies to the C library and kmdb.

In the kernel, the header assym.h provides these values; for example:

#define T_STACK 0x4
#define T_SWAP  0x68
#define T_WCHAN 0x44

These values are the byte offset of certain members into struct _kthread. For each of the types we want to reference from assembly, a template is provided in one of the offsets.in files. For the above, we can see in usr/src/uts/i86pc/ml/offsets.in:

_kthread        THREAD_SIZE
        t_pcb                   T_LABEL
        t_lock
        t_lockstat
        t_lockp
        t_lock_flush
        t_kpri_req
        t_oldspl
        t_pri
        t_pil
        t_lwp
        t_procp
        t_link
        t_state
        t_mstate
        t_preempt_lk
        t_stk                   T_STACK
        t_swap
        t_lwpchan.lc_wchan      T_WCHAN
        t_flag                  T_FLAGS

This file contains structure names as well their members. Each of the members listed (which do not have to be in order, nor does the list need to be complete) cause a define to be generated; by default, an uppercase version of the member name is used. As can be seen, this can be overridden by specifying a #define name to be used. The THREAD_SIZE define corresponds to the bytesize of the entire structure (it's also possible to generate a "shift" value, which is log2(size)).

To generate the header with the right offset and size values we need, a script is used to generate CTF data for the needed types, which then uses this data to output the assym.h header. This is a Perl script called genoffsets, and the build invokes it with a command line akin to:

genoffsets -s ctfstabs -r ctfconvert cc < offsets.in > assym.h

The hand-written offsets.in file serves as input to the script, and it generates the header we need. The script takes the following steps:

  1. Two temporary files are generated from the input. One is a C file consisting of #includes and any other pre-processor directives. The other contains the meat of the offsets file.
  2. The C file containing all the includes is built with the compile line given (I have stripped the compiler options above for readability).
  3. ctfconvert is run on the built .o file.
  4. The pre-processor is run across the second file (the temporary offsets file)
  5. This pre-processed file is passed to ctfstabs along with the .o file.

ctfstabs reads the input offsets file, and for each entry, looks up the relevant value in the CTF data contained in the .o file passed to it. It has two output modes (which I'll come to shortly), and in this case we are using the genassym driver to output the C header. As you can see, this is a fairly simple process of processing each line of the input and looking up the type data in the CTF contained in the .o file.

A similar process is used for generating forth debug files for use when debugging the kernel via the SPARC PROM. This takes a different format of offsets file more appropriate to generating the forth debug macros, described in the forth driver.

To finish off the output header, the output from a small program called genassym (or, on SPARC, genconst) is appended. It contains a bunch of printfs of constants. A lot of those don't actually need to be there since they're simple constant defines, and the assembly file could just include the right header, but others are still there for reasons such as:

  • The macros which hide assembler syntax differences such as _MUL aren't implemented for the C compiler
  • The value is an enum type, which ctfstabs doesn't support
  • The constant is a complicated composed macro that the assembler can't grok

and other reasons. Whilst a lot of these could be cleaned up and removed from these files, it's probably not worth the development effort except as a gradual change.

Tags:

Resource management of services

Nov 19, 2005
SMF introduced the notion of a service as a first-order object in the Solaris OS. Thus, you have administration interfaces capable of dealing with services (as opposed to the implicit service represented by a set of processes, for example). It doesn't seem very well known, but as Stephen Hahn mentions, this also applies to the resource management facilities of Solaris.

A service can be bound to a project (as well as a resource pool, which I won't go into here). This allows us to add resource controls to the project which will apply to the service as a whole, which is significantly more reliable and usable than trying to deal with individual daemons etc. Unfortunately, it's not as obvious to set up as it should be (of which more later), so here's a simple walkthrough.

We're going to set up a simple 'forkbomb' service, which simply runs this program:

#include <unistd.h>
#include <stdlib.h>

int main()
{
        int first = 1;
        while (1) {
                if (fork() > 0 && first)
                        exit(0);
                first = 0;
        }
}

If you try running this program in an environment lacking resource controls, don't expect to be able to do much to your box except reboot it. Note the first parent does an exit(0) so that SMF doesn't think the service has failed (since we'll be a standard contract service). Here's the SMF manifest for our service:

<?xml version="1.0"?>
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<service_bundle type='manifest' name='forkbomb'>
<service name='application/forkbomb' type='service' version='1'>
        <exec_method
            type='method'
            name='start'
            exec='/opt/forkbomb/bin/forkbomb'
            timeout_seconds='10'>
                <method_context project='forkbomb'>
                        <method_credential user='root' />
                </method_context>
        </exec_method>

        <exec_method
            type='method'
            name='stop'
            exec=':kill'
            timeout_seconds='10'>
      
        <instance name='default' enabled='false' />
</service>
</service_bundle>

Note that as well as setting the project in the method context, we've set a method credential; this is a workaround for a problem I'll come to later. Now we need to create the 'forkbomb' project for the service:

# projadd -K 'project.max-lwps=(privileged,100,deny)' forkbomb

Alternatively we could create a new user for the service to use, set the method credential to use that user, then change our 'forkbomb' project to allow the user to join it. It's important to note that this still works even for root, though, so that's what we've done here.

Finally, we can import the manifest as a service, then temporarily enable it (so it won't start next time we boot!):

# svccfg import /opt/forkbomb/manifest/forkbomb.xml
# svcadm enable -t forkbomb

The forkbomb is now running flat out, but under the constraints of the resource controls we set on its project. Thus we still have a running system, and have enough resources to disable our 'mis-behaving' service. Let's have a look at prstat:

Total: 148 processes, 266 lwps, load averages: 68.06, 20.50, 10.75
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 21145 root      992K  244K run      1    0   0:00:03 1.4% forkbomb/1
 21132 root      992K  244K run     49    0   0:00:03 1.2% forkbomb/1
 21128 root      992K  244K run     31    0   0:00:03 1.1% forkbomb/1
 21113 root      992K  244K run     31    0   0:00:03 1.1% forkbomb/1
 21176 root      992K  244K run     33    0   0:00:03 1.1% forkbomb/1
 21124 root      992K  244K run     53    0   0:00:03 1.1% forkbomb/1
 21119 root      992K  244K run     52    0   0:00:03 1.1% forkbomb/1
 21156 root      992K  244K run     53    0   0:00:03 1.0% forkbomb/1
 21088 root      992K  244K run     52    0   0:00:03 1.0% forkbomb/1
 21136 root      992K  244K run     43    0   0:00:03 1.0% forkbomb/1
 21133 root      992K  244K run     44    0   0:00:03 1.0% forkbomb/1
 21097 root      992K  244K run     52    0   0:00:03 1.0% forkbomb/1
 21103 root      992K  244K run     56    0   0:00:03 1.0% forkbomb/1
 21092 root      992K  244K run     52    0   0:00:03 1.0% forkbomb/1
 21183 root      992K  244K run     53    0   0:00:03 1.0% forkbomb/1
PROJID    NPROC  SIZE   RSS MEMORY      TIME  CPU PROJECT
   100      100   97M   24M   0.6%   0:04:47  95% forkbomb
     1        5   11M 8268K   0.3%   0:00:00 0.0% user.root
    10        3   18M 8060K   0.3%   0:00:00 0.0% group.staff
     0       40  135M   83M   2.6%   0:00:17 0.0% system
Total: 148 processes, 266 lwps, load averages: 70.60, 21.80, 11.24

As we might expect, there's a high system load (since our fork-bomb is ignoring the errors from fork() when it hits its resource limit). Note that the 'forkbomb' project has been clamped to a maximum of 100 LWPs, as you can see in the NPROC field. But most importantly, the system is still usable, and we can stop the troublesome service:

# svcadm disable forkbomb

After a while for the stop method to finish (or time out, both of which will kill all processes in the service contract), we're done!

I mentioned above that we needed to specify a method credential to work around a bug. This is bug 5093847. The way the property lookup works currently, if the use_profile property on the service isn't found, then none of the rest of the method context is examined. Setting the method credential has the side-effect of creating this property, so things work properly. This bug would also be nice to fix since we could directly set the project property via svccfg if the properties for the method context were always created. Any interested parties are strongly encouraged to have a go at fixing it - it's not currently being worked on, and I'd happy to help :)

Tags:

Reducing CTF overhead

Nov 18, 2005
CTF (Compact C Type Format) encapsulates a reduced form of debugging information similar to DWARF and the venerable stabs. It describes types (structures, unions, typedefs etc.) and function prototypes, and is carefully designed to take a minimum of space in the ELF binaries. The kernel binaries that Sun ship have this data embedded as an ELF section (.SUNW_ctf) so that tools like mdb and dtrace can understand types. Of course, it would have been possible to use existing formats such as DWARF, but they typically have a large space overhead and are more difficult to process.

The CTF data is built from the existing stabs/DWARF data generated by the compiler's -g option, and replaces this existing debugging information in the output binary (ctfconvert performs this job).

For the sake of kmdb and crash dumps, the CTF data for each kernel binary is present in the memory image of a booted kernel. This implies it's paramount that the amount of CTF data is minimised. Since each kernel module will have references to common types such as cpu_t, there's a lot of duplicated type data in all the CTF sections. To help avoid this duplication, the kernel build uses a process known rather fancifully as 'uniquification'.

Uniquification

Each type in the CTF data has an integer ID associated with it. Observe that the main genunix kernel module has a large number of the common types I mention above in its CTF data. We can remove the duplicate data found in other modules by replacing the type data with references to the type data in CTF. This process is uniquification. Consider the bmc driver. After building and linking the bmc object, we want to add CTF for its types, but we also uniquify against the genunix binary, like so:

ctfmerge -L VERSION -d ../../intel/genunix/debug64/genunix -o debug64/bmc debug64/bmc_fe.o debug64/bmc_kcs.o

This command takes the CTF data in the objects comprising bmc (previously converted from stabs/DWARF by ctfconvert) and merges them together (removing any shared duplicates between the two different objects). Then it passes through this CTF data, and looks for any types that match ones in the uniqfile (which we specified with the -d option). For each matching type (for example, cpu_t), we replace any references to the local type definition with a reference to genunix's copy of the type data. Remember that type references are simply integer IDs, so this is just a matter of changing the type ID to the one found in genunix's CTF. Let's use ctfdump to look at the results:

$ ctfdump $SRC/uts/i86pc/bmc/debug64/bmc >bmc.ctf
$ ggrep -C2 bmc_kcs_send bmc.ctf
- Types ----------------------------------------------------------------------

  <32769> STRUCT bmc_kcs_send (3 bytes)
        fnlun type=113 off=0
        cmd type=113 off=8
        data type=5287 off=16
...

Here we see the first member of the struct bmc_kcs_send has a type ID of 113. Since this type ID isn't in the CTF, it must belong to our parent. We look for our parent, then find the type ID we're looking for:

$ grep cth_parname bmc.ctf
  cth_parname  = genunix
$ ctfdump $SRC/uts/intel/genunix/debug64/genunix >genunix.ctf
$ grep '<113>' genunix.ctf
  <113> TYPEDEF uint8_t refers to 86

This manual process is similar to how the CTF lookup actually happens. This uniquification process saves us a significant amount of CTF data, although it causes us some problems, which we'll discuss next.

CTF labels and additive merges

As noted above, all our uniquified modules will have type ID's that refer to the genunix shipped along with them. This means, of course, that if any of the types in genunix itself changes without these modules changing too, all the type references to genunix types will be wrong, since it works by type ID. So, what happens when we need to release kernel changes?

Since we obviously don't want to ship all these modules every time genunix needs to change, we have to keep the existing type IDs in the new genunix binary. But also, we want to have any new or changed types present and correct too. So, instead of doing a full merge and rewriting the existing CTF data in genunix, we perform an "additive merge". This retains the existing CTF types (and IDs) so that references from unchanged modules still point to the right types, and adds on new types.

To do an additive merge, we need to pass a 'withfile' to ctfmerge via its -w option. This first takes all the CTF in the withfile and adds it into the output CTF. Then the CTF from the objects passed to ctfmerge are uniquified against this data. Any remaining types after uniquification are then added on top of the withfile data. This preserves the existing type IDs for any older modules that uniquified against this genunix, whilst also adding the new types.

This 'withfile' is the previous version of genunix. When it was built the first time, we passed -L VERSION to ctfmerge. This adds a label with the value of the environment variable $VERSION. Typically this is something like Generic. When we do the additive merge, we pass in a different label equal to the patch ID of the build, and the additional types are marked with this label. For example, on a Solaris 9 system's genunix:

- Label Table ----------------------------------------------------------------

   5001 Generic
   5981 112233-12
...

Labels are nothing but a mapping from a string to a particular type ID. So here we see that the original types are numbered from 1 to 5001, and we've done an additive merge on top with the label "112233-12", which added more types.

CTF from the ip module

The genunix module contains many common types, but the ip module also contains a lot of types used by many kernel modules, but not found in genunix. To further reduce the amount of CTF in these modules, we merge in the CTF data found in ip into the genunix CTF. The modules can then uniquify against this combined data, removing many more duplicate types. Note that we don't do this for patch builds, as the ip module might not ship in a patch. Unfortunately this can cause problems (notably bug 6347000, though this isn't yet accessible from opensolaris.org).

Further reading

Tags:

VisitorVille

Jan 5, 2005
According to VisitorVille, nearly half of web users within Sun are using Windows. Seems a bit suspect.

Committees

Dec 15, 2004
Marvellous comment from Ian Hixie, over in the infamous Mozilla bug 25537:

> Keep in mind the old saying that a committee is a life form with six or more
> legs, but no brain.

This is no committee, it's a meritocratic elite dictatorship. In fact, listening
to everyone's input, such as yours, is what would make this a committee.

I agree that committee-driven design creates poor products.

John Peel dies

Oct 26, 2004
John Peel dies aged 65.

Pretty terrible news. John Peel was an excellent radio DJ, with an obvious passion for music. He was utterly hapless at putting the records on at the right speed, and it was all part of the charm of his show. I always kept meaning to listen to his R1 show more often, and now I don't get the chance.

$HOME

Sep 19, 2004
I've not been able to access my homedir (and hence my work mail) all day. I suspect this was a planned outage I've forgotten about, but it's still a big problem. And what kind of planned outage lasts all day?

blogs.sun.com

Jun 15, 2004
A comment on IRC:

so, is the whole blogs.sun.com thing a place for employees to vent their inevitable frustrations, but in a place where absolutely no one will ever see them and thus they can't cause any damage whatever?

How very perceptive :)

Roller seems to be a bit of a PITA. There's no trivial way to preview a post, for one. You have to save a draft, then "Edit" the draft, then post it properly. Seems rather a silly way of doing things.