Solving an Android Crackme with a Little Symbolic Execution

Binary analysis frameworks provide you powerful ways of automating tasks that would be almost impossible to complete manually. In this blog, we'll have a look at Angr, a Python framework for analyzing binaries that is useful for both static and dynamic symbolic ("concolic") analysis. Angr operates on the VEX intermediate language of Valgrind fame. It comes with a loader dubbed "CLE Loads Everything", which isn't entirely accurate, but it does load ELF/ARM executables, so it is good enough for dealing with Android native libraries.

Our target program is a simple license key validation program. Granted, you won't often find something like this in the Play Store, but it should be useful enough to demonstrate the basics of symbolic analysis, and hopefully induce that awe-inspiring moment where one realizes how awesome that new concept they just fully understood actually is. You can use these techniques in many creative ways on obfuscated Android binaries (you'll find that obfuscated code is often put into native libraries to make it more diffcult to tackle).

Symbolic Execution

In the late 2000s, symbolic-execution based testing has gained popularity as a means of identifying security vulnerabilities. Symbolic "execution" actually refers to the process of representing possible paths through a program as formulas in first-order logic, whereby variables are represented by symbolic values. By using SMT solvers to verify satisfyability and providing solutions to those formulas, we can obain the concrete values needed to reach any (reachable) point of execution.

Very simply put, the process works as follows:

  1. Translate a path through a program into a logical formula, whereby some of the state is represented symbolically.
  2. Solve the formula.
  3. Profit!!

This was the reductionist account - in reality things are a bit more complex. The execution engine first enumerates many possible paths through the program - every branch is taken and not taken at the same time (not literally as in Schrödinger's cat, but in a beautifully abstract way). For each branch taken, the engine saves the contstraints imposed by the branch condition on the symbolic variable the branch depends on. You'll end up with a large number of "path formulas" - each of those describes a possible path through the program. Pick your preferred path and solve the associated formula, and voila! You get the input variables required to cover that exact path.

However, solving the formula (or even determining it's satisfiability) is actually the hard part. To understand how this works, let's recall Boolean satisfiability (SAT) problems. SAT is the problem of determining whether a propositional logic formula, such as (x1 ∨ ¬x2) ∧ (¬x1 ∨ x2 ∨ x3), is satisfiable (meaning, is it possible to generate a true result given the right inputs). Propositional logic is however insufficient for encoding all possible constraints occuring in our programs: After all, branch decisions could depend on pretty complex relations between symbolic values.

We therefore need to extend the Boolean SAT instance to a satisfiability modulo theory (SMT) instance. SMT allows us to replace some of the binary variables in the SAT formula with predicates over a suitable set of non-binary variables. The output of each predicate is a binary value. A predicate in linear algebra might for example be "2x + 3y > 1". So, for example, a particular branch might be taken when "2x - 3y > 1" is satisfied (x and y are symbolic variables). Makes sense right?

Every path formula is represented as SMT problem. The SAT solver responsible for cracking the problem simply passes conjunctions of theory predicates to specialized solvers for the respective theories, such as linear arithmetic, nonlinear arithmetic, and bitvectors. Ultimately, the problem is reduced to a plain old Boolean SAT instance that the SAT solver can handle.

If you, like me, preferred building custom pinball tables on the school desk and/or writing fantasy novels during math classes, this "excourse" might have been a bit overwhelming. I promise however that these are highly fascinating topics, and I'm including some links on mathematical logic and symbolic execution below.

A Usage Example

Amongst many other things, symbolic execution is useful in cases where we need to find the right inputs for reaching a certain block of code. In the following example, we'll use Angr to solve a simple Android crackme in an automated fashion. The crackme takes the form of a native ELF binary that can be downloaded here:

https://github.com/angr/angr-doc/tree/master/examples/android_arm_license_validation

Installing Angr

Angr is written in Python 2 and available from PyPI. It is easy to install on *nix operating systems and Mac OS using pip:

$ pip install angr


It is recommended to create a dedicated virtual environment with Virtualenv as some of its dependencies contain forked versions Z3 and PyVEX that overwrite the original versions (you may skip this step if you don't use these libraries for anything else - on the other hand, using Virtualenv is always a good idea).

Quite comprehensive documentation for angr is available on Gitbooks, including an installation guide, tutorials and usage examples. A complete API reference is also available.

Running the executable on any Android device should give you the following output.

$ adb push validate /data/local/tmp
[100%] /data/local/tmp/validate
$ adb shell chmod 755 /data/local/tmp/validate
$ adb shell /data/local/tmp/validate
Usage: ./validate <serial>
$ adb shell /data/local/tmp/validate 12345
Incorrect serial (wrong format).


So far, so good, but we really know nothing about how a valid license key might look like. Where do we start? Let's fire up IDA Pro to get a first good look at what is happening.

The main function is located at address 0x1874 in the disassembly (note that this is a PIE-enabled binary, and IDA Pro chooses 0x0 as the image base address). Function names have been stripped, but luckily we can see some references to debugging strings: It appears that the input string is base32-decoded (call to sub_1340). At the beginning of main, there's also a length check at loc_1898 that verifies that the length of the input string is exactly 16. So we're looking for a 16 character base32-encoded string! The decoded input is then passed to the function sub_1760, which verifies the validity of the license key.

The 16-character base32 input string decodes to 10 bytes, so we know that the validation function expects a 10 byte binary string. Next, we have a look at the core validation function at 0x1760:

.text:00001760 ; =============== S U B R O U T I N E =======================================
.text:00001760
.text:00001760 ; Attributes: bp-based frame
.text:00001760
.text:00001760 sub_1760                                ; CODE XREF: sub_1874+B0
.text:00001760
.text:00001760 var_20          = -0x20
.text:00001760 var_1C          = -0x1C
.text:00001760 var_1B          = -0x1B
.text:00001760 var_1A          = -0x1A
.text:00001760 var_19          = -0x19
.text:00001760 var_18          = -0x18
.text:00001760 var_14          = -0x14
.text:00001760 var_10          = -0x10
.text:00001760 var_C           = -0xC
.text:00001760
.text:00001760                 STMFD   SP!, {R4,R11,LR}
.text:00001764                 ADD     R11, SP, #8
.text:00001768                 SUB     SP, SP, #0x1C
.text:0000176C                 STR     R0, [R11,#var_20]
.text:00001770                 LDR     R3, [R11,#var_20]
.text:00001774                 STR     R3, [R11,#var_10]
.text:00001778                 MOV     R3, #0
.text:0000177C                 STR     R3, [R11,#var_14]
.text:00001780                 B       loc_17D0
.text:00001784 ; ---------------------------------------------------------------------------
.text:00001784
.text:00001784 loc_1784                                ; CODE XREF: sub_1760+78
.text:00001784                 LDR     R3, [R11,#var_10]
.text:00001788                 LDRB    R2, [R3]
.text:0000178C                 LDR     R3, [R11,#var_10]
.text:00001790                 ADD     R3, R3, #1
.text:00001794                 LDRB    R3, [R3]
.text:00001798                 EOR     R3, R2, R3    ; Aha! You're XOR-ing a byte with the byte next to it. In a loop! You bastard.
.text:0000179C                 AND     R2, R3, #0xFF
.text:000017A0                 MOV     R3, #0xFFFFFFF0
.text:000017A4                 LDR     R1, [R11,#var_14]
.text:000017A8                 SUB     R0, R11, #-var_C
.text:000017AC                 ADD     R1, R0, R1
.text:000017B0                 ADD     R3, R1, R3
.text:000017B4                 STRB    R2, [R3]
.text:000017B8                 LDR     R3, [R11,#var_10]
.text:000017BC                 ADD     R3, R3, #2
.text:000017C0                 STR     R3, [R11,#var_10]
.text:000017C4                 LDR     R3, [R11,#var_14]
.text:000017C8                 ADD     R3, R3, #1
.text:000017CC                 STR     R3, [R11,#var_14]
.text:000017D0
.text:000017D0 loc_17D0                                ; CODE XREF: sub_1760+20
.text:000017D0                 LDR     R3, [R11,#var_14]
.text:000017D4                 CMP     R3, #4
.text:000017D8                 BLE     loc_1784
.text:000017DC                 LDRB    R4, [R11,#var_1C] ; Now you're comparing the xor-ed bytes with values retrieved from - somewhere...
.text:000017E0                 BL      sub_16F0
.text:000017E4                 MOV     R3, R0
.text:000017E8                 CMP     R4, R3
.text:000017EC                 BNE     loc_1854
.text:000017F0                 LDRB    R4, [R11,#var_1B]
.text:000017F4                 BL      sub_170C
.text:000017F8                 MOV     R3, R0
.text:000017FC                 CMP     R4, R3
.text:00001800                 BNE     loc_1854
.text:00001804                 LDRB    R4, [R11,#var_1A]
.text:00001808                 BL      sub_16F0
.text:0000180C                 MOV     R3, R0
.text:00001810                 CMP     R4, R3
.text:00001814                 BNE     loc_1854
.text:00001818                 LDRB    R4, [R11,#var_19]
.text:0000181C                 BL      sub_1728
.text:00001820                 MOV     R3, R0
.text:00001824                 CMP     R4, R3
.text:00001828                 BNE     loc_1854
.text:0000182C                 LDRB    R4, [R11,#var_18]
.text:00001830                 BL      sub_1744
.text:00001834                 MOV     R3, R0
.text:00001838                 CMP     R4, R3
.text:0000183C                 BNE     loc_1854
.text:00001840                 LDR     R3, =(aProductActivat - 0x184C)  ; This is where we want to be!
.text:00001844                 ADD     R3, PC, R3      ; "Product activation passed. Congratulati"...
.text:00001848                 MOV     R0, R3          ; char *
.text:0000184C                 BL      puts
.text:00001850                 B       loc_1864
.text:00001854 ; ---------------------------------------------------------------------------
.text:00001854
.text:00001854 loc_1854                                ; CODE XREF: sub_1760+8C
.text:00001854                                         ; sub_1760+A0j ...
.text:00001854                 LDR     R3, =(aIncorrectSer_0 - 0x1860) ; This is where we DON'T wanna be!
.text:00001858                 ADD     R3, PC, R3      ; "Incorrect serial."
.text:0000185C                 MOV     R0, R3          ; char *
.text:00001860                 BL      puts
.text:00001864
.text:00001864 loc_1864                                ; CODE XREF: sub_1760+F0
.text:00001864                 SUB     SP, R11, #8
.text:00001868                 LDMFD   SP!, {R4,R11,PC}
.text:00001868 ; End of function sub_1760

We can see a loop with some XOR-magic happening at loc_1784, which supposedly decodes the input string. Starting from loc_17DC, we see a series of comparisons of the decoded values with values obtained from further sub-function calls. Even though this doesn't look like highly sophisticated stuff, we'd still need to do some more analysis to completely reverse this check and generate a license key that passes it. But now comes the twist: By using dynamic symbolic execution, we don't actually have to do any further analysis! The symbolic execution engine can map a path between the first instruction of the license check (0x1760) and the code printing the "Product activation passed" message (0x1840) and determine the constraints on each byte of the input string. The solver engine then finds the input values that satisfy those constraints: The valid license key.

We only need to provide several inputs to the symbolic execution engine:

* The address to start execution from. We initialize the state with the first instruction of the serial validation function. This makes the task significantly easier (and in this case, almost instant) to solve, as we avoid symbolically executing the Base32 implementation.

* The address of the code block we want execution to reach. In this case, we want to find a path to the code responsible for printing the "Product activation passed" message. This block starts at 0x1840.

* Addresses we don't want to reach. In this case, we're not interesting in any path that arrives at the block of code printing the "Incorrect serial" message, at 0x1854.

Note that Angr loader will load the PIE executable with a base address of 0x400000, so we have to add this to the addresses above. The solution looks as follows.

#!/usr/bin/python

# This is how we defeat the Android license check using Angr!
# The binary is available for download on GitHub:
# https://github.com/b-mueller/obfuscation-metrics/tree/master/crackmes/android/01_license_check_1
# Written by Bernhard -- bernhard [dot] mueller [at] owasp [dot] org

import angr
import claripy
import base64

load_options = {}

# Android NDK library path:
load_options['custom_ld_path'] = ['/Users/berndt/Tools/android-ndk-r10e/platforms/android-21/arch-arm/usr/lib']

b = angr.Project("./validate", load_options = load_options)

# The key validation function starts at 0x401760, so that's where we create the initial state.
# This speeds things up a lot because we're bypassing the Base32-encoder.

state = b.factory.blank_state(addr=0x401760)

initial_path = b.factory.path(state)
path_group = b.factory.path_group(state)

# 0x401840 = Product activation passed
# 0x401854 = Incorrect serial

path_group.explore(find=0x401840, avoid=0x401854)
found = path_group.found[0]

# Get the solution string from *(R11 - 0x24).

addr = found.state.memory.load(found.state.regs.r11 - 0x24, endness='Iend_LE')
concrete_addr = found.state.se.any_int(addr)
solution = found.state.se.any_str(found.state.memory.load(concrete_addr,10))

print base64.b32encode(solution)

Note the last part of the program where the final input string is obtained - it appears if we were simply reading the solution from memory. We are however reading from symbolic memory - neither the string nor the pointer to it actually exist! What's really happening is that the solver is computing possible concrete values that could be found at that program state, would we observer the actual program run to that point.

Running the script should return the following:

(angr) $ python solve.py
WARNING | 2017-01-09 17:17:03,664 | cle.loader | The main binary is a position-independent executable. It is being loaded with a base address of 0x400000.
JQAE6ACMABNAAIIA

The final license key! Inputting this into the program should yield a success message.

TL/DR

Symbolic execution is a great technique with many applications in vulnerability discovery, de-obfuscation and reverse engineering. Check the links below for more information.

This article is part of the Mobile Reverse Engineering Unleashed series. Click the blue label on the top of this page to list orther articles in this series.

About the OWASP Mobile Security Testing Guide

I wrote this howto for the OWASP Mobile Security Testing Guide (MSTG), a manual for testing the security of mobile apps. The MSTG is an open source effort and we welcome contributions and feedback. To discuss and contribute, join the OWASP Mobile Security Project Slack Channel. You can sign up here:

http://owasp.herokuapp.com/

Also, check out the mobile crackmes we developed for the guide!

References

If you know of any other great resources, let us know in the comments!

About the Author

Bernhard Mueller is a full-stack hacker, security researcher, and winner of BlackHat's Pwnie Award.

Follow him on Twitter: @muellerberndt