xVM Under The Hood: seg_mf


An occasional series wherein I'll describe a part of the xVM implementation. Today, I'll be talking about seg_mf. You may want to read through my explanation of live migration and MMU virtualization first.

The control domain (dom0) often needs access to memory pages that belong to a running guest domain. The most obvious example of this is in constructing the domain during boot, but it's also needed for mapping the shared virtual guest console page, generating guest domain core dumps, etc.

This is implemented via the privcmd driver. Each process that needs to map some area of a guest domain's memory maps a range of anonymous virtual memory. The process then sends a request to the driver to map in a given range or set of machine frames into the given virtual address range. The two requests (IOCTL_PRIVCMD_MMAP and IOCTL_PRIVCMD_MMAP_BATCH) are more or less the same, although the latter allows the user to track MFNs that couldn't be mapped (see below).

Both ioctl()s hook into the seg_mf code. This is a normal Solaris segment driver (see Solaris Internals) with a hook that's used to store the arrays of MFN values that each VA range is to be backed by. This segment driver is a little unusual though: it does not support demand faulting. That is, every page in the segment is faulted in (and locked in) at the time of the ioctl(). This is needed to support the error-reporting interface described below, but it also helps simplify the driver significantly.

To fault the range, we go through each page-size chunk in the mapping. We need to establish a mapping from the virtual address of the chunk to the actual machine frame holding the page owned by the guest domain. This happens in segmf_faultpage(). The HAT isn't used to our strange request, so we load a temporary mapping at the given VA, and replace that with a mapping to the real underlying MFN via HYPERVISOR_update_va_mapping_otherdomain().

Normally, the MFNs given via the ioctl() should be mappable. One exception is HVM live migration. This was implemented, somewhat confusingly, to use the same interfaces but pass GMFNs not MFNs. In particular, for HVM guests, a guest MFN (what a guest thinks is a real machine frame number) is actually a pseudo-physical frame number. As a result, due to ballooning, or PV drivers, etc., this GMFN may not have a real MFN backing it, so the attempt to map it will fail. We mark the MFN as failed in the outgoing array of IOCTL_PRIVCMD_MMAP_BATCH and let the client deal with it. This is generally OK, since the iterative nature of live migration means we can still get to all the pages we need.

One nice enhancement would be to extend pmap to recognise such mappings. In particular qemu-dm has a bunch of such mappings. It'd be relatively easy to mark such mappings as coming from seg_mf. Extra marks for listing the MFN ranges too, though that's a little harder :)

Tags: