The spam about GDPR and CCPA I received last week, turns out to be part of a study by the US based Princeton university, with one of the researchers recently having joined the Dutch Radboud University. The more recently sent out mails apparantly had a link to the project page added, I assume in light of feedback received, which then was shared in my Mastodon timeline by someone who as a Mastodon moderator had received these mails.

I sent a mail to the research team explaining my complaint about the mails I received. I also approached the Radboud University’s Digital Security (RU DiS) research group where one of the researchers works, and filed a complaint there.
In the past few days I’ve had e-mail exchanges with the research team, as well as with the RU DiS department head. All those I approached have been very responsive and willing to provide information, which I very much appreciate.

That doesn’t make the mails I received ok though. The research team itself may have come to the same notion, as they informed me they’ve stopped sending out new mails for now. They are also working to add have added a FAQ to the project page. [UPDATE 2021-12-19 Jonathan Mayer, the Principal Investigator in this Princeton research project has now issued an apology. These are welcome words.]

On the research

The research project is interested in how companies have set up their process for responding to requests for data access under the European general data protection regulation (GDPR) and the California Consumer Privacy Act (CCPA). They also intended these requests for organisations who don’t a priori fall within scope of those acts. Both acts are intended to set a norm for those not covered by it. The GDPR is written to export the EU’s norms for data protection to the rest of the world, and the CCPA is set up to encourage companies not active in California to follow its rules regardless. So far I have no issues.

How I ended up in the list of sites approached

My blog is a personal website, so it falls outside of the declared scope of the study (companies). It can’t fall under the CCPA, as it only applies to businesses (that do business in California, with a certain turnover, or selling data). It is less clear if it falls under the GDPR: In my reading of the GDPR it doesn’t, but at the same time have written a personal data protection policy as if it does (out of professional interest). So how did I end up in Princeton’s list of site owners to approach? In my conversation with one of the researchers they indicated that the list of sites to approach was a selection taken out of the Tranco list. That list combines the results from various lists of the 1 million most popular websites. Such as Alexa (soon to be discontinued), Cisco Umbrella, and Majestic Million. My URL is in both the Alexa and the Majectic list. Cisco’s list looks at DNS requests for domains on their hardware, and unsurprisingly I’m not in their current list as it is based on today’s web traffic. The Majestic list seems to use backlinks to a site as a ranking factor. This favors old websites, as they build up a sediment of such backlinks over time. Such as weblogs that are some 20 years old, such as mine. Unsurprising then that blogs like Dave‘s, David‘s, and those of longtime blogging friends feature in the list. In the graph below you see my and their blogs as they rank in the Tranco list.

The relative positions of the blogs of several old time blogging friends and myself in the Tranco list of over 1 million sites.

That I might be on the long list when the Tranco list is used makes sense. However the research group says they used filtering and categorisation to then select the websites to approach. A meaningful selection seems less likely, given that they approached personal sites like mine (and judging by other sites approached as apparent from other online comments on the mails sent).

Still it’s wrong

The research was designed by Princeton’s computer science department, and was discussed with Princeton’s Institutional Review Board (IRB) they say. During this process the team ‘extensively discussed potential risks of our study, and took measures to minimize undue burden on websites, especially websites with less traffic and resources’.
The IRB concluded the research doesn’t constitute human subject research. True, from a design perspective, but as shown by me as a private individual receiving their e-mails not true in practice. Better determination of which sites to approach and not to approach would have been needed for that.

The e-mails sent out for this study are also worryingly problematic in two aspects:
First they pretend to be actual e-mails by individuals, nowhere is made clear it’s research. On top of that the names used for these individuals are clearly fake, and the domains from which e-mails were sent also easily raise suspicion. Furthermore the request lacks any context, an individual with a real request would never use a generic text or use the domain name and not the actual name of a website. This makes it unclear to recipients what the very purpose of the e-mails is. This is not only true for individuals or e.g. small non-profits, this is confusing and suspicious to every recipient even if they had limited their inquiries to major corporations. I’m sure that negatively impacts the results, and thus the validity of conclusions. It also means many recipients will have spent time evaluating, or worse bringing in advice, on how to deal with these suspicious looking requests.

Second the wording of the e-mail makes it worse. The mails have a legalese ring to them (e.g. stating it is not a formal data access request at this time though it might still follow, another thing a real individual would not phrase like that). What is worse each mail suggests a legal threat at the end. They say that a response is required within a month based on Article 12 of the GDPR, or within 45 days based on Section 1798.130 of the California Civil Code. Both those statements are lies. Art 12 GDPR sets a response deadline for data access requests, which this mail emphasises it is not, and the same is true for the California Civil Code.

It’s exactly this wording, with false legal threats, and lacking any context to evaluate what the purpose of the e-mails is, that makes people worry, spend time or even money figuring out what they might be exposed to. As an individual I concluded to ignore the mails, others didn’t, but would you if you are a small non-profit, or other business that does not have the inhouse legal knowledge to deal with this? Precisely those who have some knowledge about the GDPR or CCPA but not enough to be fully sure of themselves will spend unnecessary time on these requests. Princeton is thus externalising a burden and cost on website owners. Falsifying the very thing Princeton states about aiming to “minimize undue burden on websites“. Using the word websites obfuscates that every mail will have to be answered by a real person. They could have just mailed me asking me straight up for their research if I have a process for the GDPR in place. I would have replied to them and be done with it.

Filed complaint

Originally I had filed a complaint with the Digital Security research team at Radboud University, as they are named as partners in the study. Yesterday I withdrew my complaint with them, as they weren’t part of the study design, just have recently hired one of the researchers involved. Nevertheless they informed me they have alerted their own ethics board about this, to take lessons from it w.r.t guidelines and good practices, even as the head of department said to me it is now too late to prevent damage. At the same time he wrote, they cannot let it pass because “Even if privacy researchers do these projects with the best of intentions, it doesn’t mean they aren’t required to set them up well”.
It also means that I will refile my complaint with Princeton’s Review Board. Meanwhile this has spilled out online (it’s what you get if you target the 1 million most popular websites…), and I am not the only one filing a complaint judging by the responses of a tonedeaf tweet by one of the researchers.

Others blogging about this study:
Questions About GDPR Data Access Process Spam from Virginia
Free Radical: CCPA Scam
What’s the deal with those weird GDPR emails?
I Was Part of a Human Subject Research Study Without My Consent