Stylistic analysis can de-anonymize code, even compiled code

A presentation today at Defcon from Drexel computer science prof Rachel Greenstadt and GWU computer sicence prof Aylin Caliskan builds on the pair's earlier work in identifying the authors of software and shows that they can, with a high degree of accuracy, identify the anonymous author of software, whether in source-code or binary form.


The discipline of adversarial stylometry has long been locked in an arms race in which some tools make it possible to identify the author of prose, and other tools remove potential identifiers from prose to make it harder to attribute.

The finding that machine-learning trained adversarial stylometry tools can trace identifiers in compiled binaries is particularly interesting: these are the machine-readable blobs that your computer actually executes, and they're generated by automatically processing source-code (which tends to be very idiosyncratic and thus easier to identify) using complex algorithms. The survival of identifiable auctorial tics and quirks into binaries is just amazing.

This has worrying implications for, say, dissidents in China who contribute to censorship-evasion tools. And though the traditional code-obfuscation techniques (deployed to stop other programmers from reverse-engineering one's code) are not very useful in preventing automated attribution attacks, the authors hold out hopes that purpose-built tools designed to strip out identifying elements could make code anonymous again.


Greenstadt and Caliskan have also uncovered a number of interesting insights about the nature of programming. For example, they have found that experienced developers appear easier to identify than novice ones. The more skilled you are, the more unique your work apparently becomes. That might be in part because beginner programmers often copy and paste code solutions from websites like Stack Overflow.

Similarly, they found that code samples addressing more difficult problems are also easier to attribute. Using a sample set of 62 programmers, who each solved seven "easy" problems, the researchers were able to de-anonymize their work 90 percent of the time. When the researchers used seven "hard" problem samples instead, their accuracy bumped to 95 percent.

In the future, Greenstadt and Caliskan want to understand how other factors might affect a person’s coding style, like what happens when members of the same organization collaborate on a project. They also want to explore questions like whether people from different countries code in different ways. In one preliminary study for example, they found they could differentiate between code samples written by Canadian and by Chinese developers with over 90 percent accuracy.


Even Anonymous Coders Leave Fingerprints [Louise Matsakis/Wired]