The citation graph is one of humankind's most important intellectual achievements

Dario Taraborelli

10:29 am Sat, Apr 14, 2018

Today, citations are also a primary source of data. Funders and evaluation bodies use them to appraise scientific impact and decide which ideas are worth funding to support scientific progress. Because of this, data that forms the citation graph should belong to the public. The Initiative for Open Citations was created to achieve this goal.

Back in the 1950s, reference works like Shepard’s Citations provided lawyers with tools to reconstruct which relevant cases to cite in the context of a court trial. No such a tool existed at the time for identifying citations in scientific publications. Eugene Garfield — the pioneer of modern citation analysis and citation indexing — described the idea of extending this approach to science and engineering as his Eureka moment. Garfield’s first experimental Genetics Citation Index, compiled by the newly-formed Institute for Scientific Information (ISI) in 1961, offered a glimpse into what a full citation index could mean for science at large. It was distributed, for free, to 1,000 libraries and scientists in the United States.

Fast forward to the end of the 20th century. the Web of Science citation index — maintained by Thomson Reuters, who acquired ISI in 1992 — has become the canonical source for scientists, librarians, and funders to search scholarly citations, and for the field of scientometrics, to study the structure and evolution of scientific knowledge. ISI could have turned into a publicly funded initiative, but it started instead as a for-profit effort. In 2016, Thomson Reuters sold its Intellectual Property & Science business to a private-equity fund for $3.55 billion. Its citation index is now owned by Clarivate Analytics.

Raw citation data being non-copyrightable, it’s ironic that the vision of building a comprehensive index of scientific literature has turned into a billion-dollar business, with academic institutions paying cripplingly expensive annual subscriptions for access and the public locked out.

Companies such as Clarivate Analytics or Elsevier (who owns its own citation index, Scopus) have put substantial efforts into creating proprietary high-quality indexes out of raw citation data, and proprietary metrics based on this data to assess the impact of scientific publications. But the fact that the citation data itself — produced by the labor of millions of researchers as part of their scientific communication activity — is not a public good that anyone can access is nothing short of “a scandal”, as long-standing open citations advocate David Shotton eloquently put it.

“Openness is central to the research endeavor,” says Cassidy Sugimoto and collaborators in an open letter published by the International Society for Scientometrics and Informetrics. “It is essential to promote reproducibility and appraisal of research, reduce misconduct, and ensure equitable access to and participation in science. Yet, calls for increased openness in science are often met with initial resistance.”

Proprietary citation databases are available to universities and funding bodies via expensive subscriptions, but the restrictive nature of their licenses means that these databases don’t allow any kind of reuse or fully reproducible data analysis. Building on citation data is only possible to those people and organizations licensed to access proprietary databases.

There are no citation databases that support the open, unconstrained reuse of their underlying data. Opening up the data that forms the citation graph — to quote the open letter from ISSI — “is a matter of scientific integrity, scientific progress, and equity.”

Enter the Initiative for Open Citations.

In 2016, a small group founded the Initiative for Open Citations (I4OC) as a voluntary effort to work with scholarly publishers — who routinely publish this data — to persuade them to release it in the open and promote its unrestricted availability. Before the launch of the I4OC, only 1% of indexed scholarly publications with references were making citation data available in the public domain. When the I4OC was officially announced in 2017, we were able to report that this number had shifted from 1% to 40%. In the main, this was thanks to the swift action of a small number of large academic publishers.

In April 2018, we are celebrating the first anniversary of the initiative. Since the launch, the fraction of indexed scientific articles with open citation data (as measured by Crossref) has surpassed 50% and the number of participating publishers has risen to 490. Over half a billion references are now openly available to the public without any copyright restriction. Of the top-20 biggest publishers with citation data, all but 5 — Elsevier, IEEE, Wolters Kluwer Health, IOP Publishing, ACS — now make this data open via Crossref and its APIs. Over 50 organisations — including science funders, platforms and technology organizations, libraries, research and advocacy institutions — have joined us in this journey to help advocate and promote the reuse of open citations.

Data liberated by the I4OC is now integrated into bibliometric analysis tools, reused as linked open data in citation corpora, used by volunteer contributors in collaborative knowledge bases and it powers the catalogues of a growing number of scholarly databases.

The publishers who have released their raw citation data into the public domain are making the vision of an open citation graph a reality. But we are only halfway there. We urge the remaining publishers to join this effort — and researchers, practitioners, librarians, scholarly societies, and members of the public who believe in this vision to help us reach our 100% target. The world is waiting for the citation graph to become a public good.

Dario Taraborelli (@readermeter) is an open knowledge advocate and the Director of Research at the @Wikimedia Foundation.

(Image: Dartar, CC-BY)