An Indian research university has assembled 73 million journal articles (without permission) and is offering the archive for unfettered scientific text-mining

Cory Doctorow

2:47 pm Thu, Jul 18, 2019

The JNU Data Depot is a joint project between rogue archivist Carl Malamud (previously), bioinformatician Andrew Lynn, and a research team from New Delhi’s Jawaharlal Nehru University: together, they have assembled 73 million journal articles from 1847 to the present day and put them into an airgapped respository that they’re offering to noncommercial third parties who want to perform textual analysis on them to “pull out insights without actually reading the text.”

This text-mining process is already well-developed and has produced startling scientific insights, including “databases of genes and chemicals, map[s of] associations between proteins and diseases, and [automatically] generate[d] useful scientific hypotheses.” But the hard limit of this kind of text mining is the paywalls that academic and scholarly publishers put around their archives, which both limit who can access the collections and what kinds of queries they can run against them.

By putting 73 million articles in a repository without having to bargain with the highly concentrated and notoriously rent-seeking scholarly publishing industry, the JNU Data Depot team are able to dispense with the arbitrary restrictions put on data-mining. They believe that they are on the right side of Indian copyright law as well, as they are a scholarly institution that is making a single digital copy for local use, and not circulating the articles on the internet; they believe that these precautions might shield them from a lawsuit.

They’re relying on precedent set in a 2016 Delhi High Court Ruling that turned on the legality of a copy shop that sold photocopied selections from expensive textbooks, where the court held that section 52 of the 1957 Copyright Act allows reproduction of copyrighted works for education and research.

Malamud won’t say where the articles came from, but he did tell Nature that he came into possession of eight hard-drives’ worth of articles from Sci-Hub, the pirate research site whose mission is to liberate scholarly and scientific works from paywalls and ensure that they are universally available. Sci-Hub was founded in memory of Aaron Swartz, a collaborator of Malamud’s who was persecuted by the FBI and threatened with decades in prison for downloading scientific articles from MIT’s network. Swartz hanged himself in 2013, after the federal prosecutors on the case had used legal delaying tactics to drain Swartz’s savings, including the sums he got from the sale of Reddit, which had acquired a company he founded, to Conde Nast.

Malamud argues that the High Court ruling applies regardless of the source of the articles and that the Google Book Search precedent also makes his project legal under US law as well.

The project has already attracted users, like National Institute of Plant Genome Research computational biologist Gitanjali Yadav, who is using the Depot to augment her EssOilDB, a database of chemicals secreted by plants that is heavily used by drug developers, perfumiers, and other kinds of researchers. EssOilDB was built with queries against Google Scholar and Pubmed, but the Depot’s repository holds out the possibility of massively expanding it.

Other projects eyeing up the Depot include a database of genes linked to type 2 diabetes; and an MIT Media Lab group that studies “how academic publishing has evolved over time” and hopes to “forecast emerging areas of research and identify alternatives to conventional metrics for measuring research impact.”

Though the research that Malamud is reproducing is often copyrighted by for-profit scholarly publishers, they typically do not pay to undertake, document, edit or review the papers they publish. The vast majority of the research in journals is publicly funded, and the authors of these works — the scientists and scholars who conduct the research — are not compensated for signing over their copyrights to journals. The journals also rely on volunteers (again, generally scholars whose salaries are paid by public grants or public universities and research institutions) to sort, edit and review the articles they publish, as well as to sit on the editorial boards of their journals. The publishers’ contribution is often little more than taking work produced at public expense and sticking it behind a paywall.

The vast majority of large scholarly publishers told Nature “that researchers looking to mine their papers needed their authorization.”

Malamud acknowledges that there is some risk in what he is doing. But he argues that it is “morally crucial” to do it, especially in India. Indian universities and government labs spend heavily on journal subscriptions, he says, and still don’t have all the publications they need. Data released by Sci-Hub indicate that Indians are among the world’s biggest users of their website, suggesting that university licences don’t go far enough. Although open-access movements in Europe and the United States are valuable, India needs to lead the way in liberating access to scientific knowledge, Malamud says. “I don’t think we can wait for Europe and the United States to solve that problem because the need is so pressing here.”

The plan to mine the world’s research papers [Priyanka Pulla/Nature]

(Image: Joi Ito, CC-BY)