Langsung ke konten utama

Exploiting The Linux Heart Via Bundle Sockets

Guest spider web log post, posted past times Andrey Konovalov

Introduction

Lately I’ve been spending some fourth dimension fuzzing network-related Linux gist interfaces amongst syzkaller. Besides the late discovered vulnerability inwards DCCP sockets, I also constitute some other one, this fourth dimension inwards bundle sockets. This post describes how the põrnikas was discovered too how nosotros tin transportation away exploit it to escalate privileges.

The põrnikas itself (CVE-2017-7308) is a signedness issue, which leads to an exploitable heap-out-of-bounds write. It tin transportation away hold out triggered past times providing specific parameters to the PACKET_RX_RING choice on an AF_PACKET socket amongst a TPACKET_V3 band buffer version enabled. As a number the next sanity banking concern check inwards the packet_set_ring() component division inwards net/packet/af_packet.c tin transportation away hold out bypassed, which afterward leads to an out-of-bounds access.

4207                 if (po->tp_version >= TPACKET_V3 &&
4208                     (int)(req->tp_block_size -
4209                           BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv)) <= 0)
4210                         goto out;

The põrnikas was introduced on Aug 19, 2011 inwards the commit f6fb8f10 ("af-packet: TPACKET_V3 flexible buffer implementation") together amongst the TPACKET_V3 implementation. There was an effort to create it on Aug 15, 2014 inwards commit dc808110 ("packet: grip likewise large packets for PACKET_V3") past times adding additional checks, but this was non sufficient, equally shown below. The põrnikas was fixed inwards 2b6867c2 ("net/packet: create overflow inwards banking concern check for priv surface area size") on Mar 29, 2017.

The põrnikas affects a gist if it has AF_PACKET sockets enabled (CONFIG_PACKET=y), which is the instance for many Linux gist distributions. Exploitation requires the CAP_NET_RAW privilege to hold out able to create such sockets. However it's possible to practise that from a user namespace if they are enabled (CONFIG_USER_NS=y) too accessible to unprivileged users.

Since bundle sockets are a quite widely used gist feature, this vulnerability affects a number of pop Linux gist distributions including Ubuntu too Android. It should hold out noted, that access to AF_PACKET sockets is expressly disallowed to whatsoever untrusted code within Android, although it is available to some privileged components. Updated Ubuntu kernels are already out, Android’s update is scheduled for July.

Syzkaller


The põrnikas was constitute amongst syzkaller, a coverage guided syscall fuzzer, too KASAN, a dynamic retentivity fault detector. I’m going to render some details on how syzkaller works too how to usage it for fuzzing some gist interface inwards instance someone decides to attempt this.

Let’s start amongst a quick overview of how the syzkaller fuzzer works. Syzkaller is able to generate random programs (sequences of syscalls) based on manually written template descriptions for each syscall. The fuzzer executes these programs too collects code coverage for each of them. Using the coverage information, syzkaller keeps a corpus of programs, which trigger different code paths inwards the kernel. Whenever a novel computer programme triggers a novel code path (i.e. gives novel coverage), syzkaller adds it to the corpus. Besides generating completely novel programs, syzkaller is able to mutate the existing ones from the corpus.

Syzkaller is meant to hold out used together amongst dynamic põrnikas detectors similar KASAN (detects retentivity bugs similar out-of-bounds too use-after-frees, available upstream since 4.0), KMSAN (detects uses of uninitialized memory, paradigm was just released) or KTSAN (detects information races, paradigm is available). The thought is that syzkaller stresses the gist too executes diverse interesting code paths too the detectors notice too study bugs.

The commons workflow for finding bugs amongst syzkaller is equally follows:
  1. Setup syzkaller too brand sure it works. README too wiki provides quite extensive information on how to practise that.
  2. Write template descriptions for a especial gist interface yous desire to test.
  3. Specify the syscalls that are used inwards this interface inwards the syzkaller config.
  4. Run syzkaller until it finds bugs. Usually this happens quite fast for the interfaces, that haven’t been tested amongst it previously.

Syzkaller uses it’s ain declarative linguistic communication to draw syscall templates. Checkout sys/sys.txt for an illustration or sys/README.md for the information on the syntax. Here’s an excerpt from the syzkaller descriptions for AF_PACKET sockets that I used to uncovering the bug:

resource sock_packet[sock]

define ETH_P_ALL_BE htons(ETH_P_ALL)

socket$packet(domain const[AF_PACKET], type flags[packet_socket_type], proto const[ETH_P_ALL_BE]) sock_packet

packet_socket_type = SOCK_RAW, SOCK_DGRAM

setsockopt$packet_rx_ring(fd sock_packet, marking const[SOL_PACKET], optname const[PACKET_RX_RING], optval ptr[in, tpacket_req_u], optlen len[optval])
setsockopt$packet_tx_ring(fd sock_packet, marking const[SOL_PACKET], optname const[PACKET_TX_RING], optval ptr[in, tpacket_req_u], optlen len[optval])

tpacket_req {
tp_block_size int32
tp_block_nr int32
tp_frame_size int32
tp_frame_nr int32
}

tpacket_req3 {
tp_block_size int32
tp_block_nr int32
tp_frame_size int32
tp_frame_nr int32
tp_retire_blk_tov int32
tp_sizeof_priv int32
tp_feature_req_word int32
}

tpacket_req_u [
req tpacket_req
req3 tpacket_req3
] [varlen]

The syntax is to a greater extent than frequently than non self-explanatory. First, nosotros declare a novel type sock_packet. This type is inherited from an existing type sock. That way syzkaller volition usage syscalls which take away hold arguments of type sock on sock_packet sockets equally well.

After that, nosotros declare a novel syscall socket$packet. The component division earlier the $ sign tells syzkaller what syscall it should use, too the component division after the $ sign is used to differentiate betwixt different kinds of the same syscall. This is especially useful when dealing amongst syscalls similar ioctl. The socket$packet syscall returns a sock_packet socket.

Then setsockopt$packet_rx_ring too setsockopt$packet_tx_ring are declared. These syscalls laid the PACKET_RX_RING too PACKET_TX_RING socket options on a sock_packet socket. I’ll verbalise virtually these options inwards details below. Both of them usage the tpacket_req_u wedlock equally a socket choice value. This wedlock has 2 struct members tpacket_req too tpacket_req3.

Once the descriptions are added, syzkaller tin transportation away hold out instructed to fuzz packet-related syscalls specifically. This is what I provided inwards the syzkaller director config:

"enable_syscalls": [
"socket$packet", "socketpair$packet", "accept$packet", "accept4$packet", "bind$packet", "connect$packet", "sendto$packet", "recvfrom$packet", "getsockname$packet", "getpeername$packet", "listen", "setsockopt", "getsockopt", "syz_emit_ethernet"
],

After a few minutes of running syzkaller amongst these descriptions I started getting gist crashes. Here’s 1 of the syzkaller programs that triggered the mentioned bug:

mmap(&(0x7f0000000000/0xc8f000)=nil, (0xc8f000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
r0 = socket$packet(0x11, 0x3, 0x300)
setsockopt$packet_int(r0, 0x107, 0xa, &(0x7f000061f000)=0x2, 0x4)
setsockopt$packet_rx_ring(r0, 0x107, 0x5, &(0x7f0000c8b000)=@req3={0x10000, 0x3, 0x10000, 0x3, 0x4, 0xfffffffffffffffe, 0x5}, 0x1c)

And here’s 1 of the KASAN reports. It should hold out noted, that since the access is quite far past times the block bounds, allotment too deallocation stacks don’t represent to the overflown object.

==================================================================
BUG: KASAN: slab-out-of-bounds inwards prb_close_block net/packet/af_packet.c:808
Write of size iv at addr ffff880054b70010 past times chore syz-executor0/30839

CPU: 0 PID: 30839 Comm: syz-executor0 Not tainted 4.11.0-rc2+ #94
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:16 [inline]
dump_stack+0x292/0x398 lib/dump_stack.c:52
print_address_description+0x73/0x280 mm/kasan/report.c:246
kasan_report_error mm/kasan/report.c:345 [inline]
kasan_report.part.3+0x21f/0x310 mm/kasan/report.c:368
kasan_report mm/kasan/report.c:393 [inline]
__asan_report_store4_noabort+0x2c/0x30 mm/kasan/report.c:393
prb_close_block net/packet/af_packet.c:808 [inline]
prb_retire_current_block+0x6ed/0x820 net/packet/af_packet.c:970
__packet_lookup_frame_in_block net/packet/af_packet.c:1093 [inline]
packet_current_rx_frame net/packet/af_packet.c:1122 [inline]
tpacket_rcv+0x9c1/0x3750 net/packet/af_packet.c:2236
packet_rcv_fanout+0x527/0x810 net/packet/af_packet.c:1493
deliver_skb net/core/dev.c:1834 [inline]
__netif_receive_skb_core+0x1cff/0x3400 net/core/dev.c:4117
__netif_receive_skb+0x2a/0x170 net/core/dev.c:4244
netif_receive_skb_internal+0x1d6/0x430 net/core/dev.c:4272
netif_receive_skb+0xae/0x3b0 net/core/dev.c:4296
tun_rx_batched.isra.39+0x5e5/0x8c0 drivers/net/tun.c:1155
tun_get_user+0x100d/0x2e20 drivers/net/tun.c:1327
tun_chr_write_iter+0xd8/0x190 drivers/net/tun.c:1353
call_write_iter include/linux/fs.h:1733 [inline]
new_sync_write fs/read_write.c:497 [inline]
__vfs_write+0x483/0x760 fs/read_write.c:510
vfs_write+0x187/0x530 fs/read_write.c:558
SYSC_write fs/read_write.c:605 [inline]
SyS_write+0xfb/0x230 fs/read_write.c:597
entry_SYSCALL_64_fastpath+0x1f/0xc2
RIP: 0033:0x40b031
RSP: 002b:00007faacbc3cb50 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000002a RCX: 000000000040b031
RDX: 000000000000002a RSI: 0000000020002fd6 RDI: 0000000000000015
RBP: 00000000006e2960 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000708000
R13: 000000000000002a R14: 0000000020002fd6 R15: 0000000000000000

Allocated past times chore 30534:
save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
save_stack+0x43/0xd0 mm/kasan/kasan.c:513
set_track mm/kasan/kasan.c:525 [inline]
kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:617
kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:555
slab_post_alloc_hook mm/slab.h:456 [inline]
slab_alloc_node mm/slub.c:2720 [inline]
slab_alloc mm/slub.c:2728 [inline]
kmem_cache_alloc+0x1af/0x250 mm/slub.c:2733
getname_flags+0xcb/0x580 fs/namei.c:137
getname+0x19/0x20 fs/namei.c:208
do_sys_open+0x2ff/0x720 fs/open.c:1045
SYSC_open fs/open.c:1069 [inline]
SyS_open+0x2d/0x40 fs/open.c:1064
entry_SYSCALL_64_fastpath+0x1f/0xc2

Freed past times chore 30534:
save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
save_stack+0x43/0xd0 mm/kasan/kasan.c:513
set_track mm/kasan/kasan.c:525 [inline]
kasan_slab_free+0x72/0xc0 mm/kasan/kasan.c:590
slab_free_hook mm/slub.c:1358 [inline]
slab_free_freelist_hook mm/slub.c:1381 [inline]
slab_free mm/slub.c:2963 [inline]
kmem_cache_free+0xb5/0x2d0 mm/slub.c:2985
putname+0xee/0x130 fs/namei.c:257
do_sys_open+0x336/0x720 fs/open.c:1060
SYSC_open fs/open.c:1069 [inline]
SyS_open+0x2d/0x40 fs/open.c:1064
entry_SYSCALL_64_fastpath+0x1f/0xc2

Object at ffff880054b70040 belongs to cache names_cache of size 4096
The buggy address belongs to the page:
page:ffffea000152dc00 count:1 mapcount:0 mapping:          (null) index:0x0 compound_mapcount: 0
flags: 0x500000000008100(slab|head)
raw: 0500000000008100 0000000000000000 0000000000000000 0000000100070007
raw: ffffea0001549a20 ffffea0001b3cc20 ffff88003eb44f40 0000000000000000
page dumped because: kasan: bad access detected

Memory state approximately the buggy address:
ffff880054b6ff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff880054b6ff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff880054b70000: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
                        ^
ffff880054b70080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff880054b70100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

You tin transportation away uncovering to a greater extent than details virtually syzkaller inwards it’s repository too to a greater extent than details virtually KASAN inwards the gist documentation. If yous determine to attempt syzkaller or KASAN too consider whatsoever troubles driblet an electronic mail to syzkaller@googlegroups.com or to kasan-dev@googlegroups.com.

Introduction to AF_PACKET sockets


To amend empathize the bug, the vulnerability it leads to too how to exploit it, nosotros require to empathize what AF_PACKET sockets are too how they are implemented inwards the kernel.

Overview


AF_PACKET sockets allow users to transportation or have packets on the device driver level. This for illustration lets them to implement their ain protocol on top of the physical layer or to sniff packets including Ethernet too higher levels protocol headers. To create an AF_PACKET socket a procedure must take away hold the CAP_NET_RAW capability inwards the user namespace that governs its network namespace. More details tin transportation away hold out constitute inwards the packet sockets documentation. It should hold out noted that if a gist has unprivileged user namespaces enabled, too so an unprivileged user is able to create bundle sockets.

To transportation too have packets on a bundle socket, a procedure tin transportation away usage the transportation too recv syscalls. However, bundle sockets render a way to practise this faster past times using a band buffer, that’s shared betwixt the gist too the userspace. H5N1 band buffer tin transportation away hold out created via the PACKET_TX_RING too PACKET_RX_RING socket options. The band buffer tin transportation away too so hold out mmaped past times the user too the bundle information tin transportation away too so hold out read or written straight to it.

There are a few different variants of the way the band buffer is handled past times the kernel. This variant tin transportation away hold out chosen past times the user past times using the PACKET_VERSION socket option. The departure betwixt band buffer versions tin transportation away hold out constitute inwards the kernel documentation (search for “TPACKET versions”).

One of the widely known users of AF_PACKET sockets is the tcpdump utility. This is roughly what happens when tcpdump is used to sniff all packets on a especial interface:

# strace tcpdump -i eth0
...
socket(PF_PACKET, SOCK_RAW, 768)        = 3
...
bind(3, {sa_family=AF_PACKET, proto=0x03, if2, pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0
...
setsockopt(3, SOL_PACKET, PACKET_VERSION, [1], 4) = 0
...
setsockopt(3, SOL_PACKET, PACKET_RX_RING, {block_size=131072, block_nr=31, frame_size=65616, frame_nr=31}, 16) = 0
...
mmap(NULL, 4063232, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0x7f73a6817000
...

This sequence of syscalls corresponds to the next actions:
  1. A socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)) is created.
  2. The socket is limit to the eth0 interface.
  3. Ring buffer version is laid to TPACKET_V2 via the PACKET_VERSION socket option.
  4. A band buffer is created via the PACKET_RX_RING socket option.
  5. The band buffer is mmapped inwards the userspace.

After that the gist volition start putting all packets coming through the eth0 interface inwards the band buffer too tcpdump volition read them from the mmapped part inwards the userspace.



Ring buffers


Let’s consider how to usage band buffers for bundle sockets. For consistency all of the gist code snippets below volition come upwards from the Linux gist 4.8. This is the version the latest Ubuntu 16.04.2 gist is based on.

The existing documentation to a greater extent than frequently than non focuses on TPACKET_V1 too TPACKET_V2 band buffer versions. Since the mentioned põrnikas alone affects the TPACKET_V3 version, I’m going to assume that nosotros bargain amongst that especial version for the residual of the post. Also I’m going to to a greater extent than frequently than non focus on PACKET_RX_RING ignoring PACKET_TX_RING.

A band buffer is a retentivity part used to shop packets. Each bundle is stored inwards a separate frame. Frames are grouped into blocks. In TPACKET_V3 band buffers frame size is non fixed too tin transportation away take away hold arbitrary value equally long equally a frame fits into a block.

To create a TPACKET_V3 band buffer via the PACKET_RX_RING socket choice a user must render the exact parameters for the band buffer. These parameters are passed to the setsockopt telephone telephone via a pointer to a request struct called tpacket_req3, which is defined as:

274 struct tpacket_req3 {
275         unsigned int    tp_block_size;  /* Minimal size of contiguous block */
276         unsigned int    tp_block_nr;    /* Number of blocks */
277         unsigned int    tp_frame_size;  /* Size of frame */
278         unsigned int    tp_frame_nr;    /* Total number of frames */
279         unsigned int    tp_retire_blk_tov; /* timeout inwards msecs */
280         unsigned int    tp_sizeof_priv; /* offset to somebody information surface area */
281         unsigned int    tp_feature_req_word;
282 };

Here’s what each acre agency inwards the tpacket_req3 struct:
  1. tp_block_size - the size of each block.
  2. tp_block_nr - the number of blocks.
  3. tp_frame_size - the size of each frame, ignored for TPACKET_V3.
  4. tp_frame_nr - the number of frames, ignored for TPACKET_V3.
  5. tp_retire_blk_tov - timeout after which a block is retired, fifty-fifty if it’s non fully filled amongst information (see below).
  6. tp_sizeof_priv - the size of per-block somebody area. This surface area tin transportation away hold out used past times a user to shop arbitrary information associated amongst each block.
  7. tp_feature_req_word - a laid of flags (actually just 1 at the moment), which allows to enable some additional functionality.

Each block has an associated header, which is stored at the real firstly of the retentivity surface area allocated for the block. The block header struct is called tpacket_block_desc too has a block_status field, which indicates whether the block is currently existence used past times the gist or available to the user. The commons workflow is that the gist stores packets into a block until it’s total too and so sets block_status to TP_STATUS_USER. The user too so reads required information from the block too releases it dorsum to the gist past times setting block_status to TP_STATUS_KERNEL.

186 struct tpacket_hdr_v1 {
187         __u32   block_status;
188         __u32   num_pkts;
189         __u32   offset_to_first_pkt;
...
233 };
234
235 wedlock tpacket_bd_header_u {
236         struct tpacket_hdr_v1 bh1;
237 };
238
239 struct tpacket_block_desc {
240         __u32 version;
241         __u32 offset_to_priv;
242         union tpacket_bd_header_u hdr;
243 };

Each frame also has an associated header described past times the struct tpacket3_hdr. The tp_next_offset acre points to the next frame within the same block.

162 struct tpacket3_hdr {
163         __u32 tp_next_offset;
...
176 };

When a block is fully filled amongst information (a novel bundle doesn’t jibe into the remaining space), it’s closed too released to userspace or “retired” past times the kernel. Since the user ordinarily wants to consider packets equally shortly equally possible, the gist tin transportation away release a block fifty-fifty if it’s non filled amongst information completely. This is done past times setting upwards a timer that retires electrical flow block amongst a timeout controlled past times the tp_retire_blk_tov parameter.

There’s also a way so specify per-block somebody area, which the gist won’t touching too the user tin transportation away usage to shop whatsoever information associated amongst a block. The size of this surface area is passed via the tp_sizeof_priv parameter.

If you’d similar to amend empathize how a userspace computer programme tin transportation away usage TPACKET_V3 band buffer yous tin transportation away read the illustration provided in the documentation (search for “TPACKET_V3 example“).


Implementation of AF_PACKET sockets


Let’s convey a quick await at how some of this is implemented inwards the kernel.

Struct definitions


Whenever a bundle socket is created, an associated packet_sock struct is allocated inwards the kernel:

103 struct packet_sock {
...
105         struct sock             sk;
...
108         struct packet_ring_buffer       rx_ring;
109         struct packet_ring_buffer       tx_ring;
...
123         enum tpacket_versions   tp_version;
...
130         int                     (*xmit)(struct sk_buff *skb);
...
132 };

The tp_version acre inwards this struct holds the band buffer version, which inwards our instance is laid to TPACKET_V3 past times a PACKET_VERSION setsockopt call. The rx_ring too tx_ring fields draw the have too transmit band buffers inwards instance they are created via PACKET_RX_RING too PACKET_TX_RING setsockopt calls. These 2 fields take away hold type packet_ring_buffer, which is defined as:

56 struct packet_ring_buffer {
57         struct pgv              *pg_vec;
...
70         struct tpacket_kbdq_core        prb_bdqc;
71 };

The pg_vec acre is a pointer to an array of pgv structs, each of which holds a reference to a block. Blocks are genuinely allocated separately, non equally a 1 contiguous retentivity region.

52 struct pgv {
53         char *buffer;
54 };



The prb_bdqc acre is of type tpacket_kbdq_core too its fields draw the electrical flow state of the band buffer:

14 struct tpacket_kbdq_core {
...
21         unsigned curt  blk_sizeof_priv;
...
36         char            *nxt_offset;
...
49         struct timer_list retire_blk_timer;
50 };

The blk_sizeof_priv fields contains the size of the per-block somebody area. The nxt_offset acre points within the currently active block too shows where the next bundle should hold out saved. The retire_blk_timer acre has type timer_list too describes the timer which retires electrical flow block on timeout.

12 struct timer_list {
...
17         struct hlist_node       entry;
18         unsigned long           expires;
19         void                    (*function)(unsigned long);
20         unsigned long           data;
...
31 };

Ring buffer setup


The gist uses the packet_setsockopt() component division to grip setting socket options for bundle sockets. When the PACKET_VERSION socket choice is used, the gist sets po->tp_version to the provided value.

With the PACKET_RX_RING socket choice a have band buffer is created. Internally it’s done past times the packet_set_ring() function. This component division does a lot of things, so I’ll just demo the of import parts. First, packet_set_ring() performs a bunch of sanity checks on the provided band buffer parameters:

4202                 err = -EINVAL;
4203                 if (unlikely((int)req->tp_block_size <= 0))
4204                         goto out;
4205                 if (unlikely(!PAGE_ALIGNED(req->tp_block_size)))
4206                         goto out;
4207                 if (po->tp_version >= TPACKET_V3 &&
4208                     (int)(req->tp_block_size -
4209                           BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv)) <= 0)
4210                         goto out;
4211                 if (unlikely(req->tp_frame_size < po->tp_hdrlen +
4212                                         po->tp_reserve))
4213                         goto out;
4214                 if (unlikely(req->tp_frame_size & (TPACKET_ALIGNMENT - 1)))
4215                         goto out;
4216
4217                 rb->frames_per_block = req->tp_block_size / req->tp_frame_size;
4218                 if (unlikely(rb->frames_per_block == 0))
4219                         goto out;
4220                 if (unlikely((rb->frames_per_block * req->tp_block_nr) !=
4221                                         req->tp_frame_nr))
4222                         goto out;

Then, it allocates the band buffer blocks:

4224                 err = -ENOMEM;
4225                 order = get_order(req->tp_block_size);
4226                 pg_vec = alloc_pg_vec(req, order);
4227                 if (unlikely(!pg_vec))
4228                         goto out;

It should hold out noted that alloc_pg_vec() uses the gist page allocator to allocate blocks (we’ll usage this inwards the exploit):

4104 static char *alloc_one_pg_vec_page(unsigned long order)
4105 {
...
4110         buffer = (char *) __get_free_pages(gfp_flags, order);
4111         if (buffer)
4112                 return buffer;
...
4127 }
4128
4129 static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order)
4130 {
...
4139         for (i = 0; i < block_nr; i++) {
4140                 pg_vec[i].buffer = alloc_one_pg_vec_page(order);
...
4143         }
...
4152 }

Finally, packet_set_ring() calls init_prb_bdqc(), which performs some additional steps to laid upwards a TPACKET_V3 have band buffer specifically:

4229                 switch (po->tp_version) {
4230                 case TPACKET_V3:
...
4234                         if (!tx_ring)
4235                                 init_prb_bdqc(po, rb, pg_vec, req_u);
4236                         break;
4237                 default:
4238                         break;
4239                 }

The init_prb_bdqc() component division copies provided band buffer parameters to the prb_bdqc acre of the band buffer struct, calculates some other parameters based on them, sets upwards the block retire timer too calls prb_open_block() to initialize the firstly block:

604 static void init_prb_bdqc(struct packet_sock *po,
605                         struct packet_ring_buffer *rb,
606                         struct pgv *pg_vec,
607                         union tpacket_req_u *req_u)
608 {
609         struct tpacket_kbdq_core *p1 = GET_PBDQC_FROM_RB(rb);
610         struct tpacket_block_desc *pbd;
...
616         pbd = (struct tpacket_block_desc *)pg_vec[0].buffer;
617         p1->pkblk_start = pg_vec[0].buffer;
618         p1->kblk_size = req_u->req3.tp_block_size;
...
630         p1->blk_sizeof_priv = req_u->req3.tp_sizeof_priv;
631
632         p1->max_frame_len = p1->kblk_size - BLK_PLUS_PRIV(p1->blk_sizeof_priv);
633         prb_init_ft_ops(p1, req_u);
634         prb_setup_retire_blk_timer(po);
635         prb_open_block(p1, pbd);
636 }

On of the things that the prb_open_block() component division does is it sets the nxt_offset acre of the tpacket_kbdq_core struct to dot right after the per-block somebody area:

841 static void prb_open_block(struct tpacket_kbdq_core *pkc1,
842         struct tpacket_block_desc *pbd1)
843 {
...
862         pkc1->pkblk_start = (char *)pbd1;
863         pkc1->nxt_offset = pkc1->pkblk_start + BLK_PLUS_PRIV(pkc1->blk_sizeof_priv);
...
876 }

Packet reception


Whenever a novel bundle is received, the gist is supposed to salve it into the band buffer. The substitution component division hither is __packet_lookup_frame_in_block(), which does the following:
  1. Checks whether the currently active block has plenty infinite for the packet.
  2. If yes, saves the bundle to the electrical flow block too returns.
  3. If no, dispatches the next block too saves the bundle there.

1041 static void *__packet_lookup_frame_in_block(struct packet_sock *po,
1042                                             struct sk_buff *skb,
1043                                                 int status,
1044                                             unsigned int len
1045                                             )
1046 {
1047         struct tpacket_kbdq_core *pkc;
1048         struct tpacket_block_desc *pbd;
1049         char *curr, *end;
1050
1051         pkc = GET_PBDQC_FROM_RB(&po->rx_ring);
1052         pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);
...
1075         curr = pkc->nxt_offset;
1076         pkc->skb = skb;
1077         end = (char *)pbd + pkc->kblk_size;
1078
1079         /* firstly attempt the electrical flow block */
1080         if (curr+TOTAL_PKT_LEN_INCL_ALIGN(len) < end) {
1081                 prb_fill_curr_block(curr, pkc, pbd, len);
1082                 return (void *)curr;
1083         }
1084
1085         /* Ok, unopen the electrical flow block */
1086         prb_retire_current_block(pkc, po, 0);
1087
1088         /* Now, attempt to dispatch the next block */
1089         curr = (char *)prb_dispatch_next_block(pkc, po);
1090         if (curr) {
1091                 pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);
1092                 prb_fill_curr_block(curr, pkc, pbd, len);
1093                 return (void *)curr;
1094         }
...
1101 }

Vulnerability


Bug


Let’s await closely at the following check from packet_set_ring():

4207                 if (po->tp_version >= TPACKET_V3 &&
4208                     (int)(req->tp_block_size -
4209                           BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv)) <= 0)
4210                         goto out;

This is supposed to ensure that the length of the block header together amongst the per-block somebody information is non bigger than the size of the block. Which totally makes sense, otherwise nosotros won’t take away hold plenty infinite inwards the block for them allow lone the bundle data.

However turns out this banking concern check tin transportation away hold out bypassed. In instance req_u->req3.tp_sizeof_priv has the higher fight set, casting the facial expression to int results inwards a large positive value instead of negative. To illustrate this behavior:

A = req->tp_block_size = 4096 = 0x1000
B = req_u->req3.tp_sizeof_priv = (1 << 31) + 4096 = 0x80001000
BLK_PLUS_PRIV(B) = (1 << 31) + 4096 + 48 = 0x80001030
A - BLK_PLUS_PRIV(B) = 0x1000 - 0x80001030 = 0x7fffffd0
(int)0x7fffffd0 = 0x7fffffd0 > 0

Later, when req_u->req3.tp_sizeof_priv is copied to p1->blk_sizeof_priv inwards init_prb_bdqc() (see the snippet above), it’s clamped to 2 lower bytes, since the type of the latter is unsigned short. So this põrnikas basically allows us to laid the blk_sizeof_priv of the tpacket_kbdq_core struct to arbitrary value bypassing all sanity checks.

Consequences


If nosotros search through the net/packet/af_packet.c source looking for blk_sizeof_priv usage, we’ll uncovering that it’s existence used inwards the 2 next places.

The firstly 1 is inwards init_prb_bdqc() right after it gets assigned (see the code snippet above) to laid max_frame_len. The value of p1->max_frame_len denotes the maximum size of a frame that tin transportation away hold out saved into a block. Since nosotros command p1->blk_sizeof_priv, nosotros tin transportation away brand BLK_PLUS_PRIV(p1->blk_sizeof_priv) bigger than p1->kblk_size. This volition number inwards p1->max_frame_len having a huge value, higher than the size of a block. This allows us to bypass the size check when a frame is existence copied into a block, thus causing a kernel heap out-of-bounds write.

That’s non all. Another user of blk_sizeof_priv is prb_open_block(), which initializes a block (the code snippet is to a higher house equally well). There pkc1->nxt_offset denotes the address, where the gist volition write a novel bundle when it’s existence received. The gist doesn’t intend to overwrite the block header too per-block somebody data, so it makes this address to dot right after them. Since nosotros command blk_sizeof_priv, nosotros tin transportation away command the lowest 2 bytes of nxt_offset. This allows us to command offset of the out-of-bounds write.

To amount up, this põrnikas leads to a gist heap out-of-bounds write of controlled maximum size too controlled offset upwards to virtually 64k bytes. 

Exploitation


Let’s consider how nosotros tin transportation away exploit this vulnerability. I’m going to hold out targeting x86-64 Ubuntu 16.04.2 amongst 4.8.0-41-generic gist version amongst KASLR, SMEP too SMAP enabled. Ubuntu gist has user namespaces available to unprivileged users (CONFIG_USER_NS=y too no restrictions on it’s usage), so the põrnikas tin transportation away hold out exploited to gain source privileges past times an unprivileged user. All of the exploitation steps below are performed from within a user namespace.

The Linux gist has back upwards for a few hardening features that brand exploitation to a greater extent than difficult. KASLR (Kernel Address Space Layout Randomization) puts the gist text at a random offset to brand jumping to a especial fixed address useless. SMEP (Supervisor Mode Execution Protection) causes an oops whenever the gist tries to execute code from the userspace retentivity too SMAP (Supervisor Mode Access Prevention) does the same whenever the gist tries to access the userspace retentivity directly.

Shaping heap


The thought of the exploit is to usage the heap out-of-bounds write to overwrite a component division pointer inwards the retentivity side past times side to the overflown block. For that nosotros require to specifically shape the heap, so some object amongst a triggerable component division pointer is placed right after a band buffer block. I chose the already mentioned packet_sock struct to hold out this object. We require to uncovering a way to brand the gist allocate a band buffer block too a packet_sock struct 1 next to the other.

As I mentioned above, band buffer blocks are allocated amongst the kernel page allocator (buddy allocator). It allows to allocate blocks of 2^n contiguous retentivity pages. The allocator keeps a freelist of such block for each n too returns the freelist caput when a block is requested. If the freelist for some n is empty, it finds the firstly m > n, for which the freelist is non empty too splits it inwards halves until the required size is reached. Therefore, if nosotros start repeatedly allocating blocks of size 2^n, at some dot they volition start coming from 1 high corporation retentivity block existence separate too they volition hold out side past times side each 1 to the next.

A packet_sock is allocated via the kmalloc() component division past times the slab allocator. The slab allocator is to a greater extent than frequently than non used to allocate objects of a smaller-than-one-page size. It uses the page allocator to allocate a large block of retentivity too splits this block into smaller objects. The large blocks are called slabs, thus the call of the allocator. H5N1 laid of slabs together amongst their electrical flow state too a laid of operations similar “allocate an object” too “free an object” is called a cache. The slab allocator creates a laid of full general purpose caches for objects of size 2^n. Whenever kmalloc(size) is called, the slab allocator rounds size upwards to the nearest powerfulness of 2 too uses the cache of that size.

Since the gist uses kmalloc() all the time, if nosotros attempt to allocate an object it volition most probable come upwards from 1 of the slabs already created during previous usage. However, if nosotros start allocating objects of the same size, at some dot the slab allocator volition run out of slabs for this size too volition take away hold to allocate some other 1 via the page allocator.

The size of a newly allocated slab depends on the size of objects this slab is meant for. The size of the packet_sock struct is 1920 too 1024 < 1920 <= 2048, which agency that it’ll hold out rounded to 2048 too the kmalloc-2048 cache volition hold out used. Turns out, for this especial cache the SLUB allocator (which is the sort of slab allocator used inwards Ubuntu) uses slabs of size 0x8000. So whenever the allocator runs out of slabs for the kmalloc-2048 cache, it allocates 0x8000 bytes amongst the page allocator.

Keeping all that inwards mind, this is how nosotros tin transportation away allocate a kmalloc-2048 slab next to a band buffer block:
  1. Allocate a lot (512 worked for me) of objects of size 2048 to create total currently existing slabs inwards the kmalloc-2048 cache. To practise that nosotros tin transportation away create a bunch of bundle sockets to effort allotment of packet_sock structs.
  2. Allocate a lot (1024 worked for me) page blocks of size 0x8000 to drain the page allocator freelists too effort some high-order page block to hold out split. To practise that nosotros tin transportation away create some other bundle socket too attach a band buffer amongst 1024 blocks of size 0x8000.
  3. Create a bundle socket too attach a band buffer amongst blocks of size 0x8000. The final 1 of these blocks (I’m using 2 blocks, the argue is explained below) is the 1 we’re going to overflow.
  4. Create a bunch of bundle sockets to allocate packet_sock structs too effort an allotment of at to the lowest degree 1 novel slab.
This way nosotros tin transportation away shape the heap inwards the next way:



The exact number of allocations to drain freelists too shape the heap the way nosotros desire mightiness hold out different for different setups too depend on the retentivity usage activity. The numbers to a higher house are for a to a greater extent than frequently than non idle Ubuntu machine.

Controlling the overwrite


Above I explained that the põrnikas results inwards a write of a controlled maximum size at a controlled offset out of the bounds of a band buffer block. Turns out non alone nosotros tin transportation away command the maximum size too offset, nosotros tin transportation away genuinely command the exact information (and it’s size) that’s existence written. Since the information that’s existence stored inwards a band buffer block is the bundle that’s passing through a especial network interface, nosotros tin transportation away manually transportation packets amongst arbitrary content on a raw socket through the loopback interface. If we’re doing that inwards an isolated network namespace no external traffic volition interfere.

There are a few caveats though.

First, it seems that the size of a bundle must hold out at to the lowest degree xiv bytes (12 bytes for 2 mac addresses too 2 bytes for the EtherType apparently) for it to hold out passed to the bundle socket layer. That agency that nosotros take away hold to overwrite at to the lowest degree xiv bytes. The information inwards the bundle itself tin transportation away hold out arbitrary.

Then, the lowest iii bits of nxt_offset e'er take away hold the value of 2 due to the alignment. That agency that nosotros can’t start overwriting at an 8-byte aligned offset.

Besides that, when a bundle is existence received too saved into a block, the gist updates some fields inwards the block too frame headers. If nosotros dot nxt_offset to some especial offset nosotros desire to overwrite, some information where the block too frames headers cease upwards volition belike hold out corrupted.

Another number is that if nosotros brand nxt_offset dot past times the block end, the firstly block volition hold out at nowadays closed when the firstly bundle is existence received, since the gist volition (correctly) determine that there’s no infinite left inwards the firstly block (see the __packet_lookup_frame_in_block() snippet). This is non genuinely an issue, since nosotros tin transportation away create a band buffer amongst 2 blocks. The firstly 1 volition hold out closed, the 2d 1 volition hold out overflown.

Executing code


Now, nosotros require to figure out which component division pointers to overwrite. There are a few of component division pointers fields inwards the packet_sock struct, but I ended upwards using the next two:
  1. packet_sock->xmit
  2. packet_sock->rx_ring->prb_bdqc->retire_blk_timer->func

The firstly 1 is called whenever a user tries to send a packet via a bundle socket. The commons way to get upwards privileges to source is to execute the commit_creds(prepare_kernel_cred(0)) payload inwards a procedure context. The xmit pointer is called from a procedure context, which agency nosotros tin transportation away but dot it to the executable retentivity region, which contains the payload.

To practise that nosotros require to seat our payload to some executable retentivity region. One of the possible ways for that is to seat the payload inwards the userspace, either past times mmapping an executable retentivity page or past times just defining a global component division within our exploit program. However, SMEP & SMAP volition forbid the gist from accessing too executing user retentivity directly, so nosotros require to bargain amongst them first.

For that I used the retire_blk_timer acre (the same acre used past times Philip Pettersson inwards his dmesg_restrict too it restricts the powerfulness of unprivileged users to read the gist syslog. It should hold out noted, that fifty-fifty amongst dmesg restricted the firstly user on Ubuntu tin transportation away nonetheless read the syslog from /var/log/kern.log too /var/log/syslog since he belongs to the adm group.

Another characteristic is called kptr_restrict too it doesn’t allow unprivileged users to consider pointers printed past times the gist amongst the %pK format specifier. However inwards 4.8 the free_reserved_area() component division uses %p, so kptr_restrict doesn’t care inwards this case. In 4.10 free_reserved_area() was fixed non to impress address ranges at all, but the alter was non backported to older kernels.

Fix


Let’s convey a await at the fix. The vulnerable code equally it was earlier the create is below. Remember that the user fully controls both tp_block_size too tp_sizeof_priv.

4207                 if (po->tp_version >= TPACKET_V3 &&
4208                     (int)(req->tp_block_size -
4209                           BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv)) <= 0)
4210                         goto out;

When thinking virtually a way to create this, the firstly thought that comes to heed is that nosotros tin transportation away compare the 2 values equally is without that weird conversion to int:

4207                 if (po->tp_version >= TPACKET_V3 &&
4208                     req->tp_block_size <=
4209                           BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv))
4210                         goto out;

Funny enough, this doesn’t genuinely help. The argue is that an overflow tin transportation away move on patch evaluating BLK_PLUS_PRIV inwards instance tp_sizeof_priv is unopen to the unsigned int maximum value.

177 #define BLK_PLUS_PRIV(sz_of_priv) \
178         (BLK_HDR_LEN + ALIGN((sz_of_priv), V3_ALIGNMENT))

One of the ways to create this overflow is to cast tp_sizeof_priv to uint64 earlier passing it to BLK_PLUS_PRIV. That’s just what I did inwards the create that was sent upstream.

4207                 if (po->tp_version >= TPACKET_V3 &&
4208                     req->tp_block_size <=
4209                           BLK_PLUS_PRIV((u64)req_u->req3.tp_sizeof_priv))
4210                         goto out;

Mitigation


Creating bundle socket requires the CAP_NET_RAW privilege, which tin transportation away hold out acquired past times an unprivileged user within a user namespaces. Unprivileged user namespaces expose a huge gist laid on surface, which resulted inwards quite a few exploitable vulnerabilities (CVE-2017-7184, CVE-2016-8655, ...). This sort of gist vulnerabilities tin transportation away hold out mitigated past times completely disabling user namespaces or disallowing using them to unprivileged users.

To disable user namespaces completely yous tin transportation away rebuild your gist amongst CONFIG_USER_NS disabled. Restricting user namespaces usage alone to privileged users tin transportation away hold out done past times writing 0 to /proc/sys/kernel/unprivileged_userns_clone inwards Debian-based kernel. Since version 4.9 the upstream gist has a similar /proc/sys/user/max_user_namespaces setting.

Conclusion


Right at nowadays the Linux gist has a huge number of poorly tested (from a safety standpoint) interfaces too a lot of them are enabled too exposed to unprivileged users inwards pop Linux distributions similar Ubuntu. This is plainly non expert too they require to hold out tested or restricted.

Syzkaller is an amazing tool that allows to bear witness gist interfaces via fuzzing. Even adding barebone descriptions for some other syscall ordinarily uncovers numbers of bugs. We sure as shooting require people writing syscall descriptions too fixing existing ones, since there’s a huge surface that’s nonetheless non covered too belike a ton of safety bugs buried inwards the kernel. If yous determine to contribute, we’ll hold out glad to consider a clit request.

Links


Just a bunch of related links.


Our Linux gist põrnikas finding tools:

A collection of Linux gist exploitation materials: https://github.com/xairy/linux-kernel-exploitation

Komentar

Postingan populer dari blog ini

Exception-Oriented Exploitation On Ios

Posted past times Ian Beer, This postal service covers the regain in addition to exploitation of CVE-2017-2370 , a heap buffer overflow inwards the mach_voucher_extract_attr_recipe_trap mach trap. It covers the bug, the evolution of an exploitation technique which involves repeatedly in addition to deliberately crashing in addition to how to build alive meat introspection features using onetime meat exploits. It’s a trap! Alongside a large number of BSD syscalls (like ioctl, mmap, execve in addition to so on) XNU also has a pocket-sized number of extra syscalls supporting the MACH side of the meat called mach traps. Mach trap syscall numbers start at 0x1000000. Here’s a snippet from the syscall_sw.c file where the trap tabular array is defined: /* 12 */ MACH_TRAP(_kernelrpc_mach_vm_deallocate_trap, 3, 5, munge_wll), /* xiii */ MACH_TRAP(kern_invalid, 0, 0, NULL), /* xiv */ MACH_TRAP(_kernelrpc_mach_vm_protect_trap, 5, 7, munge_wllww), Most of the mach traps a

Lifting The (Hyper) Visor: Bypassing Samsung’S Real-Time Total Protection

Posted yesteryear Gal Beniamini, Traditionally, the operating system’s total is the concluding security boundary standing betwixt an assaulter together with total command over a target system. As such, additional aid must hold upwards taken inwards lodge to ensure the integrity of the kernel. First, when a organization boots, the integrity of its primal components, including that of the operating system’s kernel, must hold upwards verified. This is achieved on Android yesteryear the verified kicking chain . However, only booting an authenticated total is insufficient—what most maintaining the integrity of the total spell the organization is executing? Imagine a scenario where an assaulter is able to abide by together with exploit a vulnerability inwards the operating system’s kernel. Using such a vulnerability, the assaulter may endeavor to subvert the integrity of the total itself, either yesteryear modifying the contents of its code, or yesteryear introducing novel attacker-co

Chrome Bone Exploit: 1 Byte Overflow As Well As Symlinks

The next article is an invitee weblog post from an external researcher (i.e. the writer is non a or Google researcher). This post is most a Chrome OS exploit I reported to Chrome VRP inward September. The folks were squeamish to allow me do a invitee post most it, therefore hither goes. The study includes a detailed writeup , therefore this post volition have got less detail. 1 byte overflow inward a DNS library In Apr I constitute a TCP port listening on localhost inward Chrome OS. It was an HTTP proxy built into shill, the Chrome OS network manager. The proxy has at nowadays been removed equally component of a fix, but its source tin give notice nonetheless move seen from an one-time revision: shill/http_proxy.cc . The code is unproblematic in addition to doesn’t seem to incorporate whatever obvious exploitable bugs, although it is real liberal inward what it accepts equally incoming HTTP. It calls into the c-ares library for resolving DNS. There was a possible 1 byte ov