Train your AI with the world's largest data-set of sarcasm, courtesy of redditors' self-tagging

Cory Doctorow

12:23 pm Mon, May 1, 2017

Redditors’ convention of tagging their sarcastic remarks is a dream come true for machine learning researchers hoping to teach computers to recognize and/or generate sarcasm.

The Self-Annotated Reddit Corpus (SARC) is a corpus with 1.3 million sarcastic remarks (“10 times more than any previous dataset”) that were tagged by redditors and stored in the database along with “user, topic, and conversation context.”

Reddit comments from December 2005 have been
made available due to web-scraping 4
; we construct
our dataset as a subset of comments from
2009-2016, comprising the vast majority of comments
and excluding noisy data from earlier years.
For each comment we provide a sarcasm label, author,
the subreddit it appeared in, the comment score as voted on by users, the date of the comment,
and the parent comment or submission.

A Large Self-Annotated Corpus for Sarcasm

[Mikhail Khodak, Nikunj Saunshi and Kiran Vodrahalli/Princeton University]

(via Marginal Revolution)