Hate-speech detection algorithms are trivial to fool

In All You Need is “Love”: Evading Hate Speech Detection, a Finnish-Italian computer science research team describe their research on evading hate-speech detection algorithms; their work will be presented next month in Toronto at the ACM Workshop on Artificial Intelligence and Security.

As Big Tech has gotten bigger and taken on an outsized role in some of the most toxic and dangerous trends in world affairs — from institutional misogyny to acts of genocide — there have been louder and louder calls for the platforms to police their users' conduct and speech.

The platforms have responded by promising a mix of algorithmic systems and human moderation, and have fielded some of these algorithms and even made them available for testing by the likes of the team behind this research.

Their findings replicate the filter-evasion findings from other domains, where even the most sophisticated systems can be beaten with trivial countermeasures (for example, Chinese image censorship can be defeated by simply flipping a banned image).

This team discuss several tactics of varying efficiency, but the most promising and easiest to implement was simply adding the word "love" to a hateful message, while running the "hate" words together in camel-case (e.g. "MartiansAreDisgustingAndShouldBeKilled love."

Yesterday, I wrote about Corynne McSherry's "five lessons from the copyright wars" for people who want the platforms to take a more active role in policing user speech.

This paper raises a sixth lesson: "The filters are unlikely to prevent the kind of activity you're worried about." In the same way that Youtube's Content ID filters are routinely subverted by copyright infringers, so too should we expect any kind of hate-speech filter to be easy for dedicated harassers to evade, meaning that the people they'll be most effective against are those who are caught accidentally — say, because their discussion of how traumatic it was to be subjected to harassment is algorithmically mistaken for harassment itself. Those people just want to work through their trauma with their friends, not systematically probe the weaknesses in the filters, and will thus be at a permanent disadvantage relative to their tormentors, whose avocation is figuring out how to get the harassment through.

The love attack “takes advantage of a fundamental vulnerability of all classification systems: they make their decision based on prevalence instead of presence,” the researchers wrote. That’s fine when a system needs to decide, say, whether content is about sports or politics, but for something like hate speech, diluting the text with more ordinary speech doesn’t necessarily lessen the hateful intent behind the message.

“The message behind these attacks is that while the hateful messages can be made clear to any human (and especially the intended victim), AI models have trouble recognizing them,” says N. Asokan, a systems security professor at Aalto University who worked on the paper.

All You Need is “Love”: Evading Hate Speech Detection [Tommi Grondahl, Luca Pajola, Mika Juuti, Mauro Conti and N. Asokan/Arxiv]

To Break a Hate-Speech Detection Algorithm, Try 'Love' [Louise Matsakis/Wired]