Two years ago a colleague let their dog swim in a lake without paying attention to the information signs. It turned out the water was infested with a type of algae that caused the dog irritation. Since then my colleague thought it would be great if you could somehow subscribe to notifications of when the quality of status of some nearby surface water changes.

Recently this colleague took a look at the provincial external communications concerning swimming waters. A provincial government has specific public tasks in designating swimming waters and monitoring its quality. It turns out there are six (6) public information or data sources from the particular province my colleague lives in concerning swimming waters.

My colleague compared those 6 datasets on a number of criteria: factual correctness, comparability based on an administrative index or key, and comparability on spatial / geographic aspects. Factual correctness here means whether the right objects have been represented in the data sets. Are the names, geographic location, status (safe, caution, unsafe) correct? Are details such as available amenities represented correctly everywhere?

Als ze me missen, ben ik vissen
A lake (photo by facemepls, license CC-BY)

As it turns out each of the 6 public data sets contains a different number of objects. The 6 data sets cannot be connected based on a unique key or ID. Slightly more than half of the swimming waters can be correlated across the 6 data sets by name, but a spatial/geographic connection isn’t always possible. 30% of swimming waters have the wrong status (safe/caution/unsafe) on the provincial website! And 13% of swimming waters are wrongly represented geometrically, meaning they end up in completely wrong locations and even municipalities on the map.

Every year at the start of the year the provincial government takes a decision which designates the public swimming waters. Yet the decision from this province cannot be found online (even though it was taken last February, and publication is mandatory). Only a draft decision can be found on the website of one of the municipalities concerned.

The differences in the 6 data sets are more or less reflective of the internal division of tasks of the province. Every department keeps its own files, and dataset. One is responsible for designating public swimming waters, another for monitoring swimming water quality. Yet another for making sure those swimming waters are represented in overall public planning / environmental plans. Another for the placement and location of information signs about the water quality, and still another for placing that same information on the website of the province. Every unit has their own task and keeps their own data set for it.

Which ultimately means large inconsistencies internally, and a confusing mix of information being presented to the public.

The Mozilla foundation has launched a new service that looks promising, which is why I am bookmarking it here. Firefox Send allows you to send up to 1GB (or 2.5GB if logged in) files to someone else. This is the same as services like Dutch WeTransfer does, except it does so with end-to-end encryption.

Files are encrypted in your browser, before being send to Mozilla’s server until downloaded. The decryption key is contained in the download URL. That download URL is not send to the receiver by Mozilla, but you do that yourself. Files can be locked with an additional password that needs to be conveyed to the receiver by the sender through other means as well. Files are kept 5 minutes, 1 or 24 hours, or 7 days, depending on your choice, and for 1 or up to 100 downloads. This makes it suitable for quick shares during conference calls for instance. Apart from the encrypted file, Mozilla only knows the IP address of the uploader and the downloader(s). Unlike services like WeTransfer where the service also has e-mail addresses for both uploader and intended downloader, and you are dependent on them sending the receivers a confirmation with the download link first.


Firefox Send doesn’t send the download link to the recipient, you do

This is an improvement in terms of data protection, even if not fully water tight (nothing ever really is, especially not if you are a singled out target by a state actor). It does satisfy the need of some of my government clients who are not allowed to use services like WeTransfer currently.

After California, now the Washington State senate has adopted a data protection and privacy act that takes the EU General Data Protection Regulation (GDPR) as an example to emulate.

This is definitely a hoped for effect of the GDPR when it was launched. European environmental and food safety standards have had similar global norm setting impact. This as for businesses it generally is more expensive to comply with multiple standards, than it is to only comply with the strictest one. We saw it earlier in companies taking GDPR demands and applying them to themselves generally. That the GDPR might have this impact, is an intentional part of how the EC is developing a third proposition in data geopolitics, between the surveillance capitalism of the US data lakes, and the data driven authoritarianism of China.

To me the GDPR is a quality assurance instrument, with its demands increasing over time. So it is encouraging to see other government entities outside the EU taking a cue from the GDPR. California and Washington State now have adopted similar laws. Five other States in the USA have introduced similar laws for debate in the past 2 months: Hawaii, Massachusetts, New Mexico, Rhode Island, and Maryland.

The number and frequency of 51% attacks on blockchains is increasing. Ethereum last month being the first of the top 20 cryptocoins to be hit. Other types of attacks mostly try to exploit general weaknesses in how exchanges operate, but this is fundamental to how blockchain is supposed to work. Combined with how blockchain projects don’t seem to deliver and are basically vaporware, we’ve definitely gone from the peak of inflated expectations to the trough of disillusion. Whether there will be a plateau of productivity remains an open question.

To me there seems to be something fundamentally wrong with plans I come across where companies would pay people for access to their personal data. This is not a well articulated thing, it just feels like the entire framing of the issue is off, so the next paragraphs are a first attempt to jot down a few notions.

To me it looks very much like a projection by companies on people of what companies themselves would do: treating data as an asset you own outright and then charging for access. So that those companies can keep doing what they were doing with data about you. It doesn’t strike me as taking the person behind that data as the starting point, nor their interests. The starting point of any line of reasoning needs to be the person the data is about, not the entity intending to use the data.

Those plans make data release, or consent for using it, fully transactional. There are several things intuitively wrong with this.

One thing it does is put everything in the context of single transactions between individuals like you and me, and the company wanting to use data about you. That seems to be an active attempt to distract from the notion that there’s power in numbers. Reducing it to me dealing with a company, and you dealing with them separately makes it less likely groups of people will act in concert. It also distracts from the huge power difference between me selling some data attributes to some corp on one side, and that corp amassing those attributes over wide swaths of the population on the other.

Another thing is it implies that the value is in the data you likely think of as yours, your date of birth, residence, some conscious preferences, type of car you drive, health care issues, finances etc. But a lot of value is in data you actually don’t have about you but create all the time: your behaviour over time, clicks on a site, reading speed and pauses in an e-book, minutes watched in a movie, engagement with online videos, the cell towers your phone pinged, the logs about your driving style of your car’s computer, likes etc. It’s not that the data you’ll think of as your own is without value, but that it feels like the magician wants you to focus on the flower in his left hand, so you don’t notice what he does with his right hand.
On top of that it also means that whatever they offer to pay you will be too cheap: your data is never worth much in itself, only in aggregate. Offering to pay on individual transaction basis is an escape for companies, not an emancipation of citizens.

One more element is the suggestion that once such a transaction has taken place everything is ok, all rights have been transferred (even if limited to a specific context and use case) and that all obligations have been met. It strikes me as extremely reductionist. When it comes to copyright authors can transfer some rights, but usually not their moral rights to their work. I feel something similar is at play here. Moral rights attached to data that describes a person, which can’t be transferred when data is transacted. Is it ok to manipulate you into a specific bubble and influence how you vote, if they paid you first for the type of stuff they needed to be able to do that to you? The EU GDPR I think takes that approach too, taking moral rights into account. It’s not about ownership of data per se, but the rights I have if your data describes me, regardless of whether it was collected with consent.

The whole ownership notion is difficult to me in itself. As stated above, a lot of data about me is not necessarily data I am aware of creating or ‘having’, and likely don’t see a need for to collect about myself. Unless paying me is meant as incentive to start collecting stuff about me for the sole purpose of selling it to a company, who then doesn’t need my consent nor make the effort to collect it about me themselves. There are other instances where me being the only one able to determine to share some data or withhold it mean risks or negative impact for others. It’s why cadastral records and company beneficial ownership records are public. So you can verify that the house or company I’m trying to sell you is mine to sell, who else has a stake or claim on the same asset, and to what amount. Similar cases might be made for new and closely guarded data, such as DNA profiles. Is it your sole individual right to keep those data closed, or has society a reasonable claim to it, for instance in the search for the cure for cancer? All that to say, that seeing data as a mere commodity is a very limited take, and that ownership of data isn’t a clear cut thing. Because of its content, as well as its provenance. And because it is digital data, meaning it has non-rivalrous and non-excludable characteristics, making it akin to a public good. There is definitely a communal and network side to holding, sharing and processing data, currently conveniently ignored in discussions about data ownership.

In short talking about paying for personal data and data lockers under my control seem to be a framing that presents data issues as straightforward but doesn’t solve any of data’s ethical aspects, just pretends that it’s taken care of. So that things may continue as usual. And that’s even before looking into the potential unintended consequences of payments.

For the UNDP in Serbia, I made an overview of existing studies into the impact of open data. I’ve done something similar for the Flemish government a few years ago, so I had a good list of studies to start from. I updated that first list with more recent publications, resulting in a list of 45 studies from the past 10 years. The UNDP also asked me to suggest a measurement framework. Here’s a summary overview of some of the things I formulated in the report. I’ll start with 10 things that make measuring impact hard, and in a later post zoom in on what makes measuring impact doable.

While it is tempting to ask for a ‘killer app’ or ‘the next tech giant’ as proof of impact of open data, establishing the socio-economic impact of open data cannot depend on that. Both because answering such a question is only possible with long term hindsight which doesn’t help make decisions in the here and now, as well as because it would ignore the diversity of types of impacts of varying sizes known to be possible with open data. Judging by the available studies and cases there are several issues that make any easy answers to the question of open data impact impossible.

1 Dealing with variety and aggregating small increments

There are different varieties of impact, in all shapes and sizes. If an individual stakeholder, such as a citizen, does a very small thing based on open data, like making a different decision on some day, how do we express that value? Can it be expressed at all? E.g. in the Netherlands the open data based rain radar is used daily by most cyclists, to see if they can get to the rail way station dry, better wait ten minutes, or rather take the car. The impact of a decision to cycle can mean lower individual costs (no car usage), personal health benefits, economic benefits (lower traffic congestion) environmental benefits (lower emissions) etc., but is nearly impossible to quantify meaningfully in itself as a single act. Only where such decisions are stimulated, e.g. by providing open data that allows much smarter, multi-modal, route planning, aggregate effects may become visible, such as reduction of traffic congestion hours in a year, general health benefits of the population, reduction of traffic fatalities, which can be much better expressed in a monetary value to the economy.

2 Spotting new entrants, and tracking SME’s

The existing research shows that previously inactive stakeholders, and small to medium sized enterprises are better positioned to create benefits with open data. Smaller absolute improvements are of bigger value to them relatively, compared to e.g. larger corporations. Such large corporations usually overcome data access barriers with their size and capital. To them open data may even mean creating new competitive vulnerabilities at the lower end of their markets. (As a result larger corporations are more likely to say they have no problem with paying for data, as that protects market incumbents with the price of data as a barrier to entry.) This also means that establishing impacts requires simultaneously mapping new emerging stakeholders and aggregating that range of smaller impacts, which both can be hard to do (see point 1).

3 Network effects are costly to track

The research shows the presence of network effects, meaning that the impact of open data is not contained or even mostly specific to the first order of re-use of that data. Causal effects as well as second and higher order forms of re-use regularly occur and quickly become, certainly in aggregate, much higher than the value of the original form of re-use. For instance the European Space Agency (ESA) commissioned my company for a study into the impact of open satellite data for ice breakers in the Gulf of Bothnia. The direct impact for ice breakers is saving costs on helicopters and fuel, as the satellite data makes determining where the ice is thinnest much easier. But the aggregate value of the consequences of that is much higher: it creates a much higher predictability of ships and the (food)products they carry arriving in Finnish harbours, which means lower stocks are needed to ensure supply of these goods. This reverberates across the entire supply chain, saving costs in logistics and allowing lower retail prices across Finland. When 
mapping such higher order and network effects, every step further down the chain of causality shows that while the bandwidth of value created increases, at the same time the certainty that open data is the primary contributing factor decreases. Such studies also are time consuming and costly. It is often unlikely and unrealistic to expect data holders to go through such lengths to establish impact. The mentioned ESA example, is part of a series of over 20 such case studies ESA commissioned over the course of 5 years, at considerable cost for instance.

4 Comparison needs context

Without context, of a specific domain or a specific issue, it is hard to asses benefits, and compare their associated costs, which is often the underlying question concerning the impact of open data: does it weigh up against the costs of open data efforts? Even though in general open data efforts shouldn’t be costly, how does some type of open data benefit compare to the costs and benefits of other actions? Such comparisons can be made in a specific context (e.g. comparing the cost and benefit of open data for route planning with other measures to fight traffic congestion, such as increasing the number of lanes on a motor way, or increasing the availability of public transport).

5 Open data maturity determines impact and type of measurement possible

Because open data provisioning is a prerequisite for it having any impact, the availability of data and the maturity of open data efforts determine not only how much impact can be expected, but also determine what can be measured (mature impact might be measured as impact on e.g. traffic congestion hours in a year, but early impact might be measured in how the number of re-users of a data set is still steadily growing year over year)

6 Demand side maturity determines impact and type of measurement possible

Whether open data creates much impact is not only dependent on the availability of open data and the maturity of the supply-side, even if it is as mentioned a prerequisite. Impact, judging by the existing research, is certain to emerge, but the size and timing of such impact depends on a wide range of other factors on the demand-side as well, including things as the skills and capabilities of stakeholders, time to market, location and timing. An idea for open data re-use that may find no traction in France because the initiators can’t bring it to fruition, or because the potential French demand is too low, may well find its way to success in Bulgaria or Spain, because local circumstances and markets differ. In the Serbian national open data readiness assessment performed by me for the World Bank and the UNDP in 2015 this is reflected in the various dimensions assessed, that cover both supply and demand, as well as general aspects of Serbian infrastructure and society.

7 We don’t understand how infrastructure creates impact

The notion of broad open data provision as public infrastructure (such as the UK, Netherlands, Denmark and Belgium are already doing, and Switzerland is starting to do) further underlines the difficulty of establishing the general impact of open data on e.g. growth. The point that infrastructure (such as roads, telecoms, electricity) is important to growth is broadly acknowledged, with the corresponding acceptance of that within policy making. This acceptance of quantity and quality of infrastructure increasing human and physical capital however does not mean that it is clear how much what type of infrastructure contributes at what time to economic production and growth. Public capital is often used as a proxy to ascertain the impact of infrastructure on growth. Consensus is that there is a positive elasticity, meaning that an increase in public capital results in an increase in GDP, averaging at around 0.08, but varying across studies and types of infrastructure. Assuming such positive elasticity extends to open data provision as infrastructure (and we have very good reasons to do so), it will result in GDP growth, but without a clear view overall as to how much.

8 E pur si muove

Most measurements concerning open data impact need to be understood as proxies. They are not measuring how open data is creating impact directly, but from measuring a certain movement it can be surmised that something is doing the moving. Where opening data can be assumed to be doing the moving, and where opening data was a deliberate effort to create such movement, impact can then be assessed. We may not be able to easily see it, but still it moves.

9 Motives often shape measurements

Apart from the difficulty of measuring impact and the effort involved in doing so, there is also the question of why such impact assessments are needed. Is an impact assessment needed to create support for ongoing open data efforts, or to make existing efforts sustainable? Is an impact measurement needed for comparison with specific costs for a specific data holder? Is it to be used for evaluation of open data policies in general? In other words, in whose perception should an impact measurement be meaningful?
The purpose of impact assessments for open data further determines and/or limits the way such assessments can be shaped.

10 Measurements get gamed, become targets

Finally, with any type of measurement, there needs to be awareness that those with a stake of interest into a measurement are likely to try and game the system. Especially so where measurements determine funding for further projects, or the continuation of an effort. This must lead to caution when determining indicators. Measurements easily become a target in themselves. For instance in the early days of national open data portals being launched worldwide, a simple metric often reported was the number of datasets a portal contained. This is an example of a ‘point’ measurement that can be easily gamed for instance by subdividing a dataset into several subsets. The first version of the national portal of a major EU member did precisely that and boasted several hundred thousand data sets at launch, which were mostly small subsets of a bigger whole. It briefly made for good headlines, but did not make for impact.

In a second part I will take a closer look at what these 10 points mean for designing a measurement framework to track open data impact.

Dutch Provinces publish open data, but it always looks like it is mostly geo-data, and hardly anything else. When talking to provinces I also get the feeling they struggle to think of data that isn’t of a geographic nature. That isn’t very surprising, a lot of the public tasks carried out by provinces have to do with spatial planning, nature and environment, and geographic data is a key tool for them. But now that we are aiding several provinces with extending their data provision, I wanted to find out in more detail.

My colleague Niene took the API of the Dutch national open data portal for a spin, and made a list of all datasets listed as stemming from a province.
I took that list and zoomed in on various aspects.

At first glance there are strong differences between the provinces: some publish a lot, others hardly anything. The Province of Utrecht publishes everything twice to the national data portal, once through the national geo-register, once through their own dataplatform. The graph below has been corrected for it.

What explains those differences? And what is the nature of the published datasets?

Geo-data is dominant
First I made a distinction between data that stems from the national geo-register to which all provinces publish, and data that stems from another source (either regional dataplatforms, or for instance direct publication through the national open data portal). The NGR is theoretically the place where all provinces share geo-data with other government entities, part of which is then marked as publicly available. In practice the numbers suggest Provinces roughly publish to the NGR in the same proportions as the graph above (meaning that of what they publish in the NGR they mark about the same percentage as open data)

  • Of the over 3000 datasets that are published by provinces as open data in the national open data portal, only 48 don’t come from the national geo-register. This is about 1.5%.
  • Of the 12 provinces, 4 do not publish anything outside the NGR: Noord-Brabant, Zeeland, Flevoland, Overijssel.

Drenthe stands out in terms of numbers of geo-data sets published, over 900. A closer look at their list shows that they publish more historic data, and that they seem to be more complete (more of what they share in the NGR is marked for open data apparantly.) The average is between 200-300, with provinces like Zuid-Holland, Noord-Holland, Gelderland, Utrecht, Groningen, and Fryslan in that range. Overijssel, like Drenthe publishes more, though less than Drenthe at about 500. This seems to be the result of a direct connection to the NGR from their regional geo-portal, and thus publishing by default. Overijssel deliberately does not publish historic data explaining some of the difference with Drenthe. (When something is updated in Overijssel the previous version is automatically removed. This clashes with open data good practice, but is currently hard to fix in their processes.)

If it isn’t geo, it hardly exists
Of the mere 48 data sets outside the NGR, just 22 (46%) are not geo-related. Overall this means that less than 1% of all open data provinces publish is not geo-data.
Of those 22, exactly half are published by Zuid-Holland alone. They for instance publish several photo-archives, a subsidy register, politician’s expenses, and formal decisions.
Fryslan is the only province publishing an inventory of their data holdings, which is 1 of their only 3 non geo-data sets.
Gelderland stands out as the single province that publishes all their geo data through the NGR, hinting at a neatly organised process. Their non-NGR open data is also all non-geo (as it should be). They publish 27% of all open non-geo data by provinces, together with Zuid-Holland account for 77% of it all.

Taking these numbers and comparing them to inventories like the one Fryslan publishes (which we made for them in 2016), and the one for Noord-Holland (which we did in 2013), the dominance of geo-data is not surprising in itself. Roughly 80% of data provinces hold is geo related. Just about a fifth to a quarter of this geo-data (15%-20% of the total) is on average published at the moment, yet it makes up over 99% of all provincial open data published. This lopsidedness means that hardly anything on the inner workings of a province, the effectivity of policy implementation etc. is available as open data.

Where the opportunities are
To improve both on the volume and on the breadth of scope of the data provinces publish, two courses of action stand open.
First, extending the availability of geo-data provinces hold. Most provinces will have a clear process for this, and it should therefore be relatively easy to do. It should therefore be possible for most provinces to get to where Drenthe currently is.
Second, take a much closer look at the in-house data that is not geo-related. About 20% of dataholdings fall in this category, and based on the inventories we did, some 90% of that should be publishable, maybe after some aggregation or other adaptations.
The lack of an inventory is an obstacle here, but existing inventories should at least be able to point the other provinces in the right direction.

Make the provision of provincial open geodata complete, embrace its dominance and automate it with proper data governance. Focus your energy on publishing ‘the rest’ where all the data on the inner workings of the province is. Provinces perpetually complain nobody is aware of what they are doing and their role in Dutch governance. Make it visible, publish your data. Stop making yourself invisible behind a stack of maps only.

(a Dutch version is available. Een Nederlandse versie van deze blogpost vind je bij The Green Land.)

Today I contributed to a session of the open data research groups at Delft University. They do this a few times per year to discuss ongoing research and explore emerging questions that can lead to new research. I’ve taken part a few times in the past, and this time they asked me to provide an overview of what I see as current developments.

Some of the things I touched upon are similar to the remarks I made in Serbia during Open Data Week in Belgrade. The new PSI Directive proposal also was on the menu. I ended with the questions I think deserve attention. They are either about how to make sure that abstract norms get translated to the very practical, and to the local level inside government, or how to ensure that critical elements get connected and visibly stay that way (such as links between regular policy goals / teams and information management)

The slides are embedded below.

[slideshare id=102667069&doc=tudopenquestions-180619173722]

Iryna Susha and Bastiaan van Loenen in the second part of our afternoon took us through their research into the data protection steps that are in play in data collaboratives. This I found very worthwile, as data governance issues of collaborative groups (e.g. public and private entities around energy transition) are regularly surfacing in my work. Both where it threatens data sovereignty for instance, or where collaboratively pooled data can hardly be shared because it has become impossible to navigate the contractual obligations connected to the data that was pooled.

To celebrate the launch of the GDPR last week Friday, Jaap-Henk Hoekman released his ‘little blue book’ (pdf)’ on Privacy Design Strategies (with a CC-BY-NC license). Hoekman is an associate professor with the Digital Security group of the ICS department at the Radboud University.

I heard him speak a few months ago at a Tech Solidarity meet-up, and enjoyed his insights and pragmatic approaches (PDF slides here).

Data protection by design (together with a ‘state of the art’ requirement) forms the forward looking part of the GDPR where the minimum requirements are always evolving. The GDPR is designed to have a rising floor that way.
The little blue book has an easy to understand outline, which cuts up doing privacy by design into 8 strategies, each accompanied by a number of tactics, that can all be used in parallel.

Those 8 strategies (shown in the image above) are divided into 2 groups, data oriented strategies and process oriented strategies.

Data oriented strategies:
Minimise (tactics: Select, Exclude, Strip, Destroy)
Separate (tactics: Isolate, Distribute)
Abstract (tactics: Summarise, Group, Perturb)
Hide (tactics: Restrict, Obfuscate, Dissociate, Mix)

Process oriented strategies:
Inform (tactics: Supply, Explain, Notify)
Control (tactics: Consent, Choose, Update, Retract)
Enforce (tactics: Create, Maintain, Uphold)
Demonstrate (tactics: Record, Audit, Report)

All come with examples and the final chapters provide suggestions how to apply them in an organisation.

Today is the day that enforcement of the GDPR, the new European data protection regulation starts. A novel part of the GDPR is that the rights of the individual described by the data follows the data. So if a US company collects my data, they are subject to the GDPR.

Compliance with the GDPR is pretty common sense, and not all that far from the data protection regulations that went before. You need to know which data you collect, have a proper reason why you collect it, have determined how long you keep data, and have protections in place to mitigate the risks of data exposure. On top of that you need to be able to demonstrate those points, and people described by your data have rights (to see what you know about them, to correct things or have data deleted, to export their data).

Compliance can be complicated if you don’t have your house fully in order, and need to do a lot of corrective steps to figure out what data you have, why you have it, whether it should be deleted and whether your protection measures are adequate enough.

That is why when the law entered into force on May 4th 2016, 2 years ago, a transition period was created in which no enforcement would take place. Those 2 years gave companies ample time to reach compliance, if they already weren’t.

The GDPR sets a de facto global norm and standard, as EU citizens data always falls under the GDPR, regardless where the data is located. US companies therefore need to comply as well when they have data about European people.

Today at the start of GDPR enforcement it turns out many US press outlets have not put the transition period to good use, although they have reported on the GDPR. They now block European IP addresses, while they ‘look at options’ to be available again to EU audiences.

From the east coast

to the west coast

In both cases the problem likely is how to deal with the 15 or so trackers those sites have that collect visitor data.

The LA Times for instance have previously reported on the GDPR, so they knew it existed.

A few days ago they asked their readers “Is your company ready?”, and last month they asked if the GDPR will help US citizens with their own privacy.

The LA Times own answers to that at the moment are “No” and “Not if you’re reading our newspaper”.

The US government is looking at whether to start asking money again for providing satellite imagery and data from Landsat satellites, according to an article in Nature.

Officials at the Department of the Interior, which oversees the USGS, have asked a federal advisory committee to explore how putting a price on Landsat data might affect scientists and other users; the panel’s analysis is due later this year. And the USDA is contemplating a plan to institute fees for its data as early as 2019.

To “explore how putting a price on Landsat data might affect” the users of the data, will result in predictable answers, I feel.

  • Public digital government held data, such as Landsat imagery, is both non-rivalrous and non-exclusionary.
  • The initial production costs of such data may be very high, and surely is in the case of satellite data as it involves space launches. Yet these costs are made in the execution of a public and mandated task, and as such are sunk costs. These costs are not made so others can re-use the data, but made anyway for an internal task (such as national security in this case).
  • The copying costs and distribution costs of additional copies of such digital data is marginal, tending to zero
  • Government held data usually, and certainly in the case of satellite data, constitute a (near) monopoly, with no easily available alternatives. As a consequence price elasticity is above 1: when the price of such data is reduced, the demand for it will rise non-lineary. The inverse is also true: setting a price for government data that currently is free will not mean all current users will pay, it will mean a disproportionate part of current usage will simply evaporate, and the usage will be much less both in terms of numbers of users as well as of volume of usage per user.
  • Data sales from one public entity to another publicly funded one, such as in this case academic institutions, are always a net loss to the public sector, due to administration costs, transaction costs and enforcement costs. It moves money from one pocket to another of the same outfit, but that transfer costs money itself.
  • The (socio-economic) value of re-use of such data is always higher than the possible revenue of selling that data. That value will also accrue to the public sector in the form of additional tax revenue. Loss of revenue from data sales will always over time become smaller than that. Free provision or at most at marginal costs (the true incremental cost of providing the data to one single additional user) is economically the only logical path.
  • Additionally the value of data re-use is not limited to the first order of re-use (in this case e.g. academic research it enables), but knows “downstream” higher order and network effects. E.g. the value that such academic research results create in society, in this case for instance in agriculture, public health and climatic impact mitigation. Also “upstream” value is derived from re-use, e.g. in the form of data quality improvement.

This precisely was why the data was made free in 2008 in the first place:

Since the USGS made the data freely available, the rate at which users download it has jumped 100-fold. The images have enabled groundbreaking studies of changes in forests, surface water, and cities, among other topics. Searching Google Scholar for “Landsat” turns up nearly 100,000 papers published since 2008.

That 100-fold jump in usage? That’s the price elasticity being higher than 1, I mentioned. It is a regularly occurring pattern where fees for data are dropped, whether it concerns statistics, meteo, hydrological, cadastral, business register or indeed satellite data.

The economic benefit of the free Landsat data was estimated by the USGS in 2013 at $2 billion per year, while the programme costs about $80 million per year. That’s an ROI factor for US Government of 25. If the total combined tax burden (payroll, sales/VAT, income, profit, dividend etc) on that economic benefit would only be as low as 4% it still means it’s no loss to the US government.

It’s not surprising then, when previously in 2012 a committee was asked to look into reinstating fees for Landsat data, it concluded

“Landsat benefits far outweigh the cost”. Charging money for the satellite data would waste money, stifle science and innovation, and hamper the government’s ability to monitor national security, the panel added. “It is in the U.S. national interest to fund and distribute Landsat data to the public without cost now and in the future,”

European satellite data open by design

In contrast the European Space Agency’s Copernicus program which is a multiyear effort to launch a range of Sentinel satellites for earth observation, is designed to provide free and open data. In fact my company, together with EARSC, in the past 2 years and in the coming 3 years will document over 25 cases establishing the socio-economic impact of the usage of this data, to show both primary and network effects, such as for instance for ice breakers in Finnish waters, Swedish forestry management, Danish precision farming and Dutch gas mains preventative maintenance and infrastructure subsidence.

(Nature article found via Tuula Packalen)

Many tech companies are rushing to arrange compliance with GDPR, Europe’s new data protection regulations. What I have seen landing in my inbox thus far is not encouraging. Like with Facebook, other platforms clearly struggle, or hope to get away, with partially or completely ignoring the concepts of informed consent and unforced consent and proving consent. One would suspect the latter as Facebooks removal of 1.5 billion users from EU jurisdiction, is a clear step to reduce potential exposure.

Where consent by the data subject is the basis for data collection: Informed consent means consent needs to be explicitly given for each specific use of person related data, based on a for laymen clear explanation of the reason for collecting the data and how precisely it will be used.
Unforced means consent cannot be tied to core services of the controlling/processing company when that data isn’t necessary to perform a service. In other words “if you don’t like it, delete your account” is forced consent. Otherwise, the right to revoke one or several consents given becomes impossible.
Additionally, a company needs to be able to show that consent has been given, where consent is claimed as the basis for data collection.

Instead I got this email from Twitter earlier today:

“We encourage you to read both documents in full, and to contact us as described in our Privacy Policy if you have questions.”

and then

followed by

You can also choose to deactivate your Twitter account.

The first two bits mean consent is not informed and that it’s not even explicit consent, but merely assumed consent. The last bit means it is forced. On top of it Twitter will not be able to show content was given (as it is merely assumed from using their service). That’s not how this is meant to work. Non-compliant in other words. (IANAL though)

Just received an email from Sonos (the speaker system for streaming) about the changes they are making to their privacy statement. Like with FB in my previous posting this is triggered by the GDPR starting to be enforced from the end of May.

The mail reads in part

We’ve made these changes to comply with the high demands made by the GDPR, a law adopted in the European Union. Because we think that all owners of Sonos equipment deserve these protections, we are implementing these changes globally.

This is precisely the hoped for effect, I think. Setting high standards in a key market will lift those standards globally. It is usually more efficient to internally work according to one standard, than maintaining two or more in parallel. Good to see it happening, as it is a starting point for the positioning of Europe as a distinct player in global data politics, with ethics by design as the distinctive proposition. GDPR isn’t written as a source of red tape and compliance costs, but to level the playing field and enable companies to compete by building on data protection compliance (by demanding ‘data protection by design’ and following ‘state of the art’, which are both rising thresholds). Non-compliance in turn is becoming the more costly option (if GDPR really gets enforced, that is).

It seems, from a preview for journalists, that the GDPR changes that Facebook will be making to its privacy controls, and especially the data controls a user has, are rather unimpressive. I had hoped that with the new option to select ranges of your data for download, you would also be able to delete specific ranges of data. This would be a welcome change as current options are only deleting every single data item by hand, or deleting everything by deleting your account. Under the GDPR I had expected more control over data on FB.

It also seems they still keep the design imbalanced, favouring ‘let us do anything’ as the simplest route for users to click through, and presenting other options very low key, and the account deletion option still not directly accessible in your settings.

They may or may not be deemed to have done enough towards implementing GDPR by the data protection authorities in the EU after May 25th, but that’s of little use to anyone now.

So my intention to delete my FB history still means the full deletion of my account. Which will be effective end of this week, when the 14 day grace period ends.

Jonathan Gray has published an article on Data Worlds, as a way to better understand and experiment with the consequences of the datafication of our lives. The article appeared in Krisis, an open access journal for contemporary philisophy, in its latest edition dealing with Data Activism.

Jonathan Gray writes

The notion of data worlds is intended to make space for thinking about data as more than simply a representational resource, and the politics of data as more than a matter of liberation and protection. It is intended to encourage exploration of the performative capacities of data infrastructures: what they do and could do differently, and how they are done and could be done differently. This includes consideration of, as Geoffrey Bowker puts it, “the ways in which our social, cultural and political values are braided into the wires, coded into the applications and built into the databases which are so much a part of our daily lives”

He describes 3 ‘data worlds’, and positions them as an instrument intended for practical usage.

The three aspects of data worlds which I examine below are not intended to be comprehensive, but illustrative of what is involved in data infrastructures, what they do, and how they are put to work. As I shall return to in the conclusion, this outline is intended to open up space for not only thinking about data differently, but also doing things with data differently. The test of these three aspects is therefore not only their analytical purchase, but also their practical utility.

Those 3 worlds mentioned are

  1. Data Worlds as Horizons of Intelligibility, where data is plays a role in changing what is sayable, knowable, intelligible and experienceable , where data allows us to explore new perspectives, arrive at new insights or even new overall understanding. Hans Rosling’s work with Gapminder falls in this space, and datavisualisations that combine time and geography. To me this feels like approaching what John Thackara calls Macroscopes, where one finds a way to understand complete systems and one’s own place and role in it, and not just the position of oneself. (a posting on Macroscopes will be coming)
  2. Data Worlds as Collective Accomplishments, where consequences (political, social, economic) result from not just one or a limited number of actors, but from a wide variety of them. Open data ecosystems and the shifts in how civil society, citizens and governments interact, but also big data efforts by the tech industry are examples Gray cites. “Looking at data worlds as collective accomplishments includes recognising the role of actors whose contributions may otherwise be under-recognised.
  3. Data Worlds as Transnational Coordination, in terms of networks, international institutions and norm setting, which aim to “shape the world through coordination of data“. In this context one can think of things like IATI, a civic initiative bringing standardisation and transparency to international aid globally, but also the GDPR through which the EU sets a new de-facto global standard on data protection.

This seems at first reading like a useful thinking tool in exploring the consequences and potential of various values and ethics related design choices.

(Disclosure: Jonathan Gray and I wore both active in the early European open data community, and are co-authors of the first edition/iteration of the Open Data Handbook in 2010)