Save/restore of MP Solaris domUs


In honour of our new release of OpenSolaris on Xen, here's some details on the changes I've made to support save/resume (and hence migration and live migration) with MP Solaris domUs. As before, to actually see the code I'm describing, you'll need to download the sources - sorry about that.

Under Xen, the suspend process is somewhat unusual, in that only the CPU context for (virtual) CPU0 is stored in the state file. This implies that the actual suspend operation must be performed on CPU0, and we have to find some other way of capturing CPU context (that is, the register set) for the other CPUs.

In the Linux implementation, Xen suspend/resume works by using the standard CPU hotplug support for all CPUs other than CPU0. Whilst this works well for Linux, this approach is more troublesome for Solaris. Hot-unplugging the other CPUs doesn't match well with the mapping between Xen and Solaris notions of "offline" CPUs (the interested can read the big comment on line 406 of usr/src/uts/i86xen/os/mp_xen.c for a description of how this mapping currently works). In particular, offline CPUs on Solaris still participate in IPI interrupts, whilst a "down" VCPU in Xen cannot.

In addition, the standard CPU offlining code in Solaris is not built for this purpose; for example, it will refuse to offline a CPU with bound threads, or the last CPU in a partition.

However, all we really need to do is get the other CPUs into a known state which we can recover during the resume process. All the dispatcher data structures etc. associated with the CPUs can remain in place. To this end, we can use pause_cpus() on the other CPUs. By replacing the pause handler with a special routine (cpu_pause_suspend()), we can store the CPU context via a setjmp(), waiting until all CPUs have reached the barrier. We need to disable interrupts (or rather, Xen's virtualized equivalent of interrupts), as we have to tear down all the interrupts as part of the suspend process, and we need to ensure none of the CPUS go wandering off.

Once all CPUs are blocked at the known synchronisation point, we can tell Xen to "down" the other VCPUs so they can no longer run, and complete the remaining cleanup we need to do before we tell Xen we're ready to stop via HYPERVISOR_suspend().

On resume, we will come back on CPU0, as Xen stored the context for that CPU itself. After redoing some of the setup we tore down during suspend, we can move on to resuming the other CPUs. For each CPU, we call mach_cpucontext_restore(). We use the same Xen call used to create the CPUs during initial boot. In this routine, we fiddle a little bit with the context saved in the jmpbuf by setjmp(); because we're not actually returning via a normal longjmp() call, we need to emulate it. This means adjusting the stack pointer to simulate a ret, and pretending we've returned 1 from setjmp() by setting the %eax or %rax register in the context.

When each CPU's context is created, it will look as if it's just returned from the setjmp() in cpu_pause_suspend(), and will continue upon its merry way.

Inevitably, being a work-in-progress, there are still bugs and unresolved issues. Since offline CPUs won't participate in a cpu_pause(), we need to make sure that those CPUs (which will typically be sitting in the idle loop) are safe; currently this isn't being done. There are also some open issues with 64-bit live migration, and suspending SMP domains with virtual disks, which we're busy working on.

Tags: