In December 2021, Robin Kok wrote a series of tweets about his Elsevier data access request. I did the same a few days later. This here is the resulting collaborative blog post, summarizing our journey in trying to understand what data Elsevier collects; what data Elsevier has collected on us two specifically; and trying to get this data deleted. A PDF version of this blog post is also available.
Elsevier, data kraken
Everybody in academia knows Elsevier. Even if you think you don’t, you probably do. Not only do they publish over 2,500 scientific journals, but they also own the citation database Scopus, as well as the ScienceDirect collection of electronic journals from which you get your papers. That nifty PURE system your university wants you to use to keep track of your publications and projects? You guessed it: Elsevier. And what about that marvelous reference manager, Mendeley? Elsevier bought it in 2013. The list goes on and on.
But what exactly is Elsevier? We follow the advice of an Elsevier spokesperson: “if you think that information should be free of charge, go to Wikipedia”. Let’s do that! Wikipedia, in their core summary section, introduces Elsevier as “a Netherlands-based academic publishing company specializing in scientific, technical, and medical content.”
The intro continues:
And it’s not just rent-seeking. Elsevier admitted to writing “sponsored article compilation publications, on behalf of pharmaceutical clients, that were made to look like journals and lacked the proper disclosures“; offered Amazon vouchers to a select group of researchers to submit five star reviews on Amazon for certain products; manipulated citation reports; and is one of the leading lobbyists against open access and open science efforts. For this, Elsevier’s parent company, RELX, even employs two full-time lobbyists in the European Parliament, feeding “advice” into the highest levels of legislation and science organization. Here is a good summary of Elsevier’s problematic practices—suffice it to say that they’re very good at making profits.
As described by Wikipedia, one way to make profits is Elsevier’s business as an academic publisher. Academics write articles for Elsevier journals for free and hand over copyright; other academics review and edit these papers for free; and Elsevier then sells these papers back to academics. Much of the labor that goes into Elsevier products is funded by public money, only for Elsevier to sell the finished products back e.g. to university libraries, using up even more public money.
But in the 2020s—and now we come to the main topic of this piece—there is a second way of making money: selling data. Elsevier’s parent company RELX bills itself as “a global provider of information-based analytics and decision tools for professional and business customers”. And Elsevier itself has been busy with rebranding, too:
This may sound irrelevant to you as a researcher, but here we show how Elsevier helps them to monetize your data; the amount of data they have on you; and why it will require major steps to change this troubling situation.
Data access request
Luckily, folks over at Elsevier “take your privacy and trust in [them] very seriously”, so we used the Elsevier Privacy Support Hub to start an “access to personal information” request. Being in the EU, we are legally entitled under the European General Data Protection Regulation (GDPR) to ask Elsevier what data they have on us, and submitting this request was easy and quick.
After a few weeks, we both received responses by email. We had been assigned numbers 0000034 and 0000272 respectively, perhaps implying that relatively few people have made use of this system yet. The emails contained several files with a wide range of our data, in different formats. One of the attached excel files had over 700,000 cells of data, going back many years, exceeding 5mb in file size. We want to talk you through a few examples of what Elsevier knows about us.
They have your data
To start with, of course they have information we have provided them with in our interactions with Elsevier journals: full names, academic affiliations, university e-mail addresses, completed reviews and corresponding journals, times when we declined review requests, and so on.
Apart from this, there was a list of IP addresses. Checking these IP addresses identified one of us in the small city we live in, rather than where our university is located. We also found several personal user IDs, which is likely how Elsevier connects our data across platforms and accounts. We were also surprised to see multiple (correct) private mobile phone numbers and e-mail addresses included.
And there is more. Elsevier tracks which emails you open, the number of links per email clicked, and so on.
We also found our personal address and bank account details, probably because we had received a small payment for serving as a statistical reviewer1. These €55 sure came with a privacy cost larger than anticipated.
Data called “Web Traffic via Adobe Analytics” appears to list which websites we visited, when, and from which IP address. “ScienceDirect Usage Data” contains information on when we looked at which papers, and what we did on the corresponding website. Elsevier appears to distinguish between downloading or looking at the full paper and other types of access, such as looking at a particular image (e.g. “ArticleURLrequestPage”, “MiamiImageURLrequestPage”, and “MiamiImageURLreadPDF”), although it’s not entirely clear from the data export. This leads to a general issue that will come up more often in this piece: while Elsevier shared what data they have on us, and while they know what the data mean, it was often unclear for us navigating the data export what the data mean. In that sense, the usefulness of the current data export is, at least in part, questionable. In the extreme, it’s a bit like asking google what they know about you and they send you a file full of special characters that have no meaning to you.
Going back to what data they have, next up: Mendeley. Like many, both of us have used this reference manager for years. For one of us, the corresponding tab in the excel file from Elsevier contained a whopping 213,000 lines of data, from 2016 to 2022. For the other, although he also used Mendeley extensively for years, the data export contained no information on Mendeley data whatsoever, a discrepancy for which we could not find an explanation. Elsevier appears to log every time you open Mendeley, and many other things you do with the software—we found field codes such as “OpenPdfIn InternalViewer”, “UserDocument Created”, “DocumentAnnotation Created”, “UserDocument Updated”, “FileDownloaded”, and so on.
They use your data
Although many of these data points seem relatively innocent at first, they can easily be monetized, because you can extrapolate core working hours, vacation times, and other patterns of a person’s life. This can be understood as detailed information about the workflow of academics – exactly the thing we would want to know if, like Elsevier, our goal was to be a pervasive element in the entire academic lifecycle.
This interest in academic lifecycle data is not surprising, given the role of Elsevier’s parent company RELX as a global provider of information-based analytics and decision tools, as well as Elsevier’s rebranding towards an Information Analytics Business. Collecting data comes at a cost for a company, and it is safe to assume that they wouldn’t gather data if they didn’t intend to do something with it.
One of the ways to monetize your data is painfully obvious: oldschool spam email tactics such as trying to get you to use more Elsevier services by signing you up for newsletters. Many academics receive unending floods of unsolicited emails and newsletters by Elsevier, which prompted one of us to do the subject access request in the first place. In the data export, we found a huge list of highly irrelevant newsletters we were unknowingly subscribed to—for one of us, the corresponding part of the data on “communications” has over 5000 rows.
You agreed to all of this?
Well, actually, now that you ask, we don’t quite recall consenting to Mendeley collecting data that could be used to infer information on our working hours and vacation time. After all, with this kind of data, it is entirely possible that Elsevier knows our work schedule better than our employers. And what about the unsolicited emails that we received even after unsubscribing? For most of these, it’s implausible that we would have consented. As you can see in the screenshot above, during one day (sorry, night!), at 3:20am, within a single minute, one of us “signed up” to no fewer than 50 newsletters at the same time – nearly all unrelated to our academic discipline.
Does Elsevier really have our consent for these and other types of data they collected? The data export seems to answers this question, too, with aptly named columns such as “no consent” and “unknown consent”, the 0s and 1s probably marking “yes” or “no”.
You can check-out any time you like…?
Elsevier knows a lot about us, and the data they sent us in response to our access request may only scratch the surface. Although they sent a large volume of data, inconsistencies we found (like missing Mendeley data from one of us) make us doubt whether it is truly all the data they have. What to do? The answer seems straightforward: we can just stop donating our unpaid time and our personal and professional data, right? Indeed, more than 20,000 researchers have already taken a stand against Elsevier’s business practices, by openly refusing to publish in (or review / do editorial work for) Elsevier.
But that does not really solve the problem we’re dealing with here. A lot of your data Elsevier might monetize is data you cannot really avoid to provide as an academic. For example, many of you will access full texts of papers through the ScienceDirect website, which often requires an institutional login. Given that the login is uniquely identifiable, they know exactly which papers you’ve looked at, and when. This also pertains to all of the other Elsevier products, some of which we briefly mentioned above, as well as emails. Many emails may be crucial for you (e.g. from an important journal), and Elsevier logs what emails you open and whether you click on links. Sure, this is probably standard marketing practice and Elsevier is not the only company doing it, but it doesn’t change the fact that as an active academic, you basically cannot avoid giving them data they can sell. In fact, just nominating someone for peer review can be enough to get them on their list. Did you ever realize that for most reviews you’re invited to, you actually never consented to being approached by the given journal?
Elsevier has created a system where it seems impossible to avoid giving them your data. Dominating or at least co-dominating the market of academic publishing, they exploited free labor of researchers, and charged universities very high amounts of money so researchers could access scientific papers (which, in part, they wrote, reviewed and edited themselves). This pseudo-monopoly made Elsevier non-substitutable, which now allows their transition into a company selling your data.
Worse, they say that “personal information that is integral to editorial history will be retained for as long as the articles are being made available”, as they write in their supporting information document on data collection and processing we received as part of the access request. What data exactly are integral to editorial history remains unclear.
If not interacting with Elsevier is not a sustainable solution in the current infrastructure, maybe some more drastic measures are required. So one of us took the most drastic step available on Elsevier’s privacy hub: a deletion of personal information request.
This was also promptly handled, but leaves two core concerns. First, it is not entirely clear to us what information was retained by Elsevier, for example, because they consider it “integral to editorial history”. And second, how sustainable is data deletion if all it takes to be sucked back into the Elsevier data ecosystem again is one of your colleagues recommending you as a reviewer for one of the 600,000 articles Elsevier publishes per year?
Some of the issues mentioned here, such as lack of consent, seem problematic to us from the perspective of e.g. European data protection laws. Is it ok for companies to sign us up to newsletters without consent? Is it ok to collect and retain personal data indefinitely because Elsevier argues it is necessary?
And when Elsevier writes in the supporting information that they do “not undertake any automated decision making in relation to your personal information” (which may violate European laws), can that be true when they write, in the same document, that they are using personal information to tailoring experiences? “We are using your personal data for […] enhancing your experience of those products, for example by providing personalized recommendations based on your use of the products.”
We are not legal scholars, and maybe there is no fire here. But from where we stand, there seems to be an awful lot of smoke. We hope that legal and privacy experts can bring clarity to the questions we raise above—because we simply don’t know what to do about a situation that is becoming increasingly alarming.
Thanks to Björn Brembs for comments on an earlier version.