The JNU Data Depot is a joint project between rogue archivist Carl Malamud (previously), bioinformatician Andrew Lynn, and a research team from New Delhi’s Jawaharlal Nehru University: together, they have assembled 73 million journal articles from 1847 to the present day and put them into an airgapped respository that they’re offering to noncommercial third parties who want to perform textual analysis on them to “pull out insights without actually reading the text.”
This text-mining process is already well-developed and has produced startling scientific insights, including “databases of genes and chemicals, map[s of] associations between proteins and diseases, and [automatically] generate[d] useful scientific hypotheses.” But the hard limit of this kind of text mining is the paywalls that academic and scholarly publishers put around their archives, which both limit who can access the collections and what kinds of queries they can run against them.
By putting 73 million articles in a repository without having to bargain with the highly concentrated and notoriously rent-seeking scholarly publishing industry, the JNU Data Depot team are able to dispense with the arbitrary restrictions put on data-mining. They believe that they are on the right side of Indian copyright law as well, as they are a scholarly institution that is making a single digital copy for local use, and not circulating the articles on the internet; they believe that these precautions might shield them from a lawsuit.
They’re relying on precedent set in a 2016 Delhi High Court Ruling that turned on the legality of a copy shop that sold photocopied selections from expensive textbooks, where the court held that section 52 of the 1957 Copyright Act allows reproduction of copyrighted works for education and research.
Malamud won’t say where the articles came from, but he did tell Nature that he came into possession of eight hard-drives’ worth of articles from Sci-Hub, the pirate research site whose mission is to liberate scholarly and scientific works from paywalls and ensure that they are universally available. Sci-Hub was founded in memory of Aaron Swartz, a collaborator of Malamud’s who was persecuted by the FBI and threatened with decades in prison for downloading scientific articles from MIT’s network. Swartz hanged himself in 2013, after the federal prosecutors on the case had used legal delaying tactics to drain Swartz’s savings, including the sums he got from the sale of Reddit, which had acquired a company he founded, to Conde Nast.
Malamud argues that the High Court ruling applies regardless of the source of the articles and that the Google Book Search precedent also makes his project legal under US law as well.
The project has already attracted users, like National Institute of Plant Genome Research computational biologist Gitanjali Yadav, who is using the Depot to augment her EssOilDB, a database of chemicals secreted by plants that is heavily used by drug developers, perfumiers, and other kinds of researchers. EssOilDB was built with queries against Google Scholar and Pubmed, but the Depot’s repository holds out the possibility of massively expanding it.
Other projects eyeing up the Depot include a database of genes linked to type 2 diabetes; and an MIT Media Lab group that studies “how academic publishing has evolved over time” and hopes to “forecast emerging areas of research and identify alternatives to conventional metrics for measuring research impact.”
Though the research that Malamud is reproducing is often copyrighted by for-profit scholarly publishers, they typically do not pay to undertake, document, edit or review the papers they publish. The vast majority of the research in journals is publicly funded, and the authors of these works — the scientists and scholars who conduct the research — are not compensated for signing over their copyrights to journals. The journals also rely on volunteers (again, generally scholars whose salaries are paid by public grants or public universities and research institutions) to sort, edit and review the articles they publish, as well as to sit on the editorial boards of their journals. The publishers’ contribution is often little more than taking work produced at public expense and sticking it behind a paywall.
The vast majority of large scholarly publishers told Nature “that researchers looking to mine their papers needed their authorization.”
Malamud acknowledges that there is some risk in what he is doing. But he argues that it is “morally crucial” to do it, especially in India. Indian universities and government labs spend heavily on journal subscriptions, he says, and still don’t have all the publications they need. Data released by Sci-Hub indicate that Indians are among the world’s biggest users of their website, suggesting that university licences don’t go far enough. Although open-access movements in Europe and the United States are valuable, India needs to lead the way in liberating access to scientific knowledge, Malamud says. “I don’t think we can wait for Europe and the United States to solve that problem because the need is so pressing here.”
The plan to mine the world’s research papers [Priyanka Pulla/Nature]
Last month, Paul Hansmeier was sentenced to 14 years in prison and ordered to pay $1.5m in restitution for the copyright trolling his firm, Prenda Law, engaged in: the firm used a mix of entrapment, blackmail, identity theft, intimidation and fraud to extort millions from its victims by threatening to drag them into court for […]
In 2016, EFF sued the US Government on behalf of Andrew “bunnie” Huang and Matthew Green, both of whom wanted to engage in normal technological activities (auditing digital security, editing videos, etc) that put at risk from Section 1201 of the Digital Millennium Copyright Act.
Pillman is Oscar “Nanochess” Toledo’s reimplementation of Pacman (“a game about a yellow man eating pills”) in 512 bytes — small enough to fit in a boot sector — written in 8088 assembler. (via Four Short Links)
They might be the shiny new thing, but AirPods aren’t for everybody. Maybe you’re looking for a new sound or you understandably lost those tiny buds during a brisk run. If so, here’s 10 headphones and earbuds that break out of the Apple mode with a return to quality and wearability. Klipsch R5 Bluetooth Neckband […]
When it comes to passwords, there’s no such thing as paranoia. You want them secure and complex, and you definitely don’t want to repeat them on all your accounts. The trouble is, the internet seems to keep growing. And so do those accounts. Just one lockout from an important email or banking site is enough […]
With the rising temperatures on tap this summer, the climate is going to be a frequent topic of conversation, and those conversations won’t be happy ones. Luckily, there’s a way to do a little climate change of your own – in a safe and sustainable way. When it comes to personal air conditioners, EvaPolar is […]