Taint analysis for protocol reverse engineering · Case study

Context

Inside a defense R&D team, we needed to reverse-engineer a large and constantly growing number of proprietary network protocols, to understand how information is structured and what it actually contains. The work was essential, but painfully manual.

The problem

Reverse-engineering each protocol by hand was slow, tedious and didn't scale with the flow of new applications to study.

Manually dissecting proprietary protocols took a long time and was highly repetitive.
The volume of new applications made exhaustive analysis impossible.
Protocol complexity, on top of encryption and compression, meant manual analysis often missed elements of interest.

Objectives

Drastically reduce the time needed to reverse-engineer a protocol.
Increase the number of applications the team could realistically analyze.
Reliably surface fields and structure that manual inspection tends to miss.

My approach

I ran this as a research and development project, validating the idea on real targets before investing further.

1. R&D framing

Defined the hypothesis: observing an application at runtime should expose how each byte it sends over the network was produced.

2. Proof of concept

Built a working PoC on top of Valgrind and Intel PIN to instrument target processes and track data flow.

3. Testing and validation

Had the team run the PoC against known protocols, then validated the reconstructed structures against ground truth.

The technical solution

Observing an application while it runs reveals a great deal about the network packets it emits. I built a taint-analysis engine on Valgrind that:

Flags the functions and syscalls of interest, file reads, network I/O, user input, compression and encryption routines, and so on.
Taints the data touched by those functions, then follows it throughout the entire process execution.
When a byte is finally written to the network, reconstructs the exact path that byte took to get there.

Under the hood, Valgrind never runs the target directly: it disassembles the guest machine code into VEX, its RISC-like intermediate representation, instruments that IR superblock by superblock, then recompiles it. The tool hooks into that pass and keeps a shadow value, a taint tag, for every guest register, every VEX temporary (IRTemp) and every byte of memory, much as Memcheck shadows definedness, except the tag here records data origin.

Propagation follows the IR data flow: Get/Put move tags between guest registers and temporaries, Load/Store move them between memory and temporaries, and arithmetic or logical operations (Iop_Add, Iop_Xor, shifts, widening and narrowing) set a result's tag to the union of its operands' tags. A tainted byte therefore stays tainted across copies, computations and buffer boundaries, in registers as well as in memory.

Taint is introduced at sources, the intercepted syscalls and library calls that bring outside data in: read/recv, open, time, environment and terminal input, plus routines such as zlib's deflate/inflate and OpenSSL's EVP_* family. It is read back at sinks, the network writes (send, sendto, write on a socket), where each emitted byte still carries the tags of everything that produced it. The PoC ran on two backends, Valgrind/VEX and Intel PIN, to cross-check results and widen coverage.

Independent sources (file, time, counter) flow through transformations and converge at send(): the emitted packet is the union of all tracked origins.

Concretely: for an application that reads a file, compresses it and sends it over the network, the tool pinpoints which bytes in the outgoing packet came from the file, and through which transformations.

Captured packetTaint trace (data provenance)

0x0000Counter · 4B00 00 04 2A

seq++process state→ htonl()arpa/inet

0x0004Timestamp · 4B65 9C 3F 80

time()syscall→ htonl()arpa/inet

0x0008Length · 4B00 00 12 0C

open("/tmp/report.bin")libc→ read()syscall→ deflate()zlib→ EVP_EncryptUpdate()libcrypto→ htonl()arpa/inet value = byte length of the payload pipeline (0x120C = 4620)

0x000CPayload · 4620B9F 2C A8 E1 … 1D

open("/tmp/report.bin")libc→ read()syscall→ deflate()zlib→ EVP_EncryptUpdate()libcrypto→ send()syscall

Example output: each field of an unknown packet (with its offset), alongside the exact chain of libc / zlib / OpenSSL calls its bytes flowed through.

Read this way, the structure of the packet emerges on its own. The field at 0x0008 is a length, equal to the size of the payload produced by read() → deflate() → EVP_EncryptUpdate(); the field at 0x0004 comes straight from the time() syscall; and the payload at 0x000C is the content of /tmp/report.bin after zlib compression and AES encryption. Instead of guessing field boundaries by staring at hex dumps, the analyst gets both the structure and the meaning of each field directly, which is exactly what makes reverse-engineering so much faster. And because the layout is recovered in a structured form, it can be turned straight into a Wireshark dissector, generated automatically rather than written by hand.

Results

A fully functional proof of concept, validated by the team on real targets.
Analysis time for a protocol cut from about 5 hours to roughly 30 minutes.
Far broader coverage, and fields that manual inspection routinely missed now surfaced automatically.
Recovered layouts can be exported directly as Wireshark dissectors, generated automatically instead of hand-written.