Dutch Provinces publish open data, but it always looks like it is mostly geo-data, and hardly anything else. When talking to provinces I also get the feeling they struggle to think of data that isn’t of a geographic nature. That isn’t very surprising, a lot of the public tasks carried out by provinces have to do with spatial planning, nature and environment, and geographic data is a key tool for them. But now that we are aiding several provinces with extending their data provision, I wanted to find out in more detail.

My colleague Niene took the API of the Dutch national open data portal for a spin, and made a list of all datasets listed as stemming from a province.
I took that list and zoomed in on various aspects.

At first glance there are strong differences between the provinces: some publish a lot, others hardly anything. The Province of Utrecht publishes everything twice to the national data portal, once through the national geo-register, once through their own dataplatform. The graph below has been corrected for it.

What explains those differences? And what is the nature of the published datasets?

Geo-data is dominant
First I made a distinction between data that stems from the national geo-register to which all provinces publish, and data that stems from another source (either regional dataplatforms, or for instance direct publication through the national open data portal). The NGR is theoretically the place where all provinces share geo-data with other government entities, part of which is then marked as publicly available. In practice the numbers suggest Provinces roughly publish to the NGR in the same proportions as the graph above (meaning that of what they publish in the NGR they mark about the same percentage as open data)

  • Of the over 3000 datasets that are published by provinces as open data in the national open data portal, only 48 don’t come from the national geo-register. This is about 1.5%.
  • Of the 12 provinces, 4 do not publish anything outside the NGR: Noord-Brabant, Zeeland, Flevoland, Overijssel.

Drenthe stands out in terms of numbers of geo-data sets published, over 900. A closer look at their list shows that they publish more historic data, and that they seem to be more complete (more of what they share in the NGR is marked for open data apparantly.) The average is between 200-300, with provinces like Zuid-Holland, Noord-Holland, Gelderland, Utrecht, Groningen, and Fryslan in that range. Overijssel, like Drenthe publishes more, though less than Drenthe at about 500. This seems to be the result of a direct connection to the NGR from their regional geo-portal, and thus publishing by default. Overijssel deliberately does not publish historic data explaining some of the difference with Drenthe. (When something is updated in Overijssel the previous version is automatically removed. This clashes with open data good practice, but is currently hard to fix in their processes.)

If it isn’t geo, it hardly exists
Of the mere 48 data sets outside the NGR, just 22 (46%) are not geo-related. Overall this means that less than 1% of all open data provinces publish is not geo-data.
Of those 22, exactly half are published by Zuid-Holland alone. They for instance publish several photo-archives, a subsidy register, politician’s expenses, and formal decisions.
Fryslan is the only province publishing an inventory of their data holdings, which is 1 of their only 3 non geo-data sets.
Gelderland stands out as the single province that publishes all their geo data through the NGR, hinting at a neatly organised process. Their non-NGR open data is also all non-geo (as it should be). They publish 27% of all open non-geo data by provinces, together with Zuid-Holland account for 77% of it all.

Taking these numbers and comparing them to inventories like the one Fryslan publishes (which we made for them in 2016), and the one for Noord-Holland (which we did in 2013), the dominance of geo-data is not surprising in itself. Roughly 80% of data provinces hold is geo related. Just about a fifth to a quarter of this geo-data (15%-20% of the total) is on average published at the moment, yet it makes up over 99% of all provincial open data published. This lopsidedness means that hardly anything on the inner workings of a province, the effectivity of policy implementation etc. is available as open data.

Where the opportunities are
To improve both on the volume and on the breadth of scope of the data provinces publish, two courses of action stand open.
First, extending the availability of geo-data provinces hold. Most provinces will have a clear process for this, and it should therefore be relatively easy to do. It should therefore be possible for most provinces to get to where Drenthe currently is.
Second, take a much closer look at the in-house data that is not geo-related. About 20% of dataholdings fall in this category, and based on the inventories we did, some 90% of that should be publishable, maybe after some aggregation or other adaptations.
The lack of an inventory is an obstacle here, but existing inventories should at least be able to point the other provinces in the right direction.

Make the provision of provincial open geodata complete, embrace its dominance and automate it with proper data governance. Focus your energy on publishing ‘the rest’ where all the data on the inner workings of the province is. Provinces perpetually complain nobody is aware of what they are doing and their role in Dutch governance. Make it visible, publish your data. Stop making yourself invisible behind a stack of maps only.

For the Province of South-Holland we’re currently helping them to extend their open data provision. Next to looking at data they hold relevant to key policy domains, we also look at what other data is available elsewhere for those domains. For instance nationwide datasets with local granular level of detail. In those cases it can be of interest to take the subset relevant for the Province and republish that through their own channels.

One of the relevant topics is energy transition (to sustainable energy sources). Current and historic household usage is of interest here. The companies that maintain the grid publish yearly data per postcode, or at least some of them do. There are seven of these companies.
Luckily all three companies active in South-Holland do publish that data.


In South-Holland three companies are active (number 3, 5 and 6)
(Source: Energielevernanciers.nl

Having this subset of data is useful for any organisation in the region that wants to limit the amount of data they have to dig through to get what they need, for the provincial organisation itself, and for individual citizens. Households that have digital meters have access to their daily energy usage readings online. This data allows them to easily compare their personal usage with their neighbours and wider surrounding area. For instance I established that our usage is lower for both electricity and gas than average in our street. It is also easier to map, or otherwise visualise, in a meaningful way for the province and relevant regional stakeholders.

Here’s a brief overview of the steps we’re taking to get to a province-wide data set.

  • Download the data for the years available for Westland, Liander and Stedin (Westland goes back to 2010, the others to 2008)
  • Check the data formats: Westland and Stedin provide CSV, Liander XLSX
  • Check data structure: all use the same structure of fields and conventions
  • To get only the data for South-Holland we use the postcode that is mentioned in the data.
  • The Dutch postcode zones do not conform to provincial boundaries however, so we take the list of four position postcodes and determine the ones that fall within South-Holland:
    • 1428-1429
    • 2159-2164
    • 2170-3381
    • 3465-3466
    • 4126-4129
    • 4140-4146
    • 4163-4169
    • 4200-4209
    • 4213
    • 4220-4249
  • The data contains 6 position postcodes of the structure 1234AB. We need to split them into the four digits and the two letters, to be able to match them with the ranges that fall within the province.
  • For personal data protection purposes, in the data, for 6 position postcodes where the number of addresses in that postcode is less than 10, the data is aggregated with a neighbouring postcode, until the number of addresses is higher than 9. It is not certain that those aggregations fall within a single province. The data provides a ‘from’ 6 position postcode and a ‘to’ 6 position postcode. This is the same value where the number of addresses in a postcode is high enough but can be a wider range.
    • We need to test if the entire postcode range in a single data record falls within one of the ranges of postcodes that belong in South-Holland.
    • For the small number of aggregates that fall into two provinces we can adopt the average usage number, but need to mark that the number of households in that area is unknown,
    • or retrieve the actual number of addresses from the national address and building database, and mark that the average energy usage values are from a larger number of addresses.
    • Alternatively we can keep the entire range, including the part outside the province,
    • or we exclude the entire range and leave a ‘hole in the map’.
    • In any case we need to mark in the data what we did, and why.
  • The result is then a data set in CSV that consolidates the three sources for all those records that fall within the province.
  • This dataset can then be mapped, e.g. in Q-GIS or other tools in use within the province South-Holland.
  • We provide a recipe and/or script from the above steps that can take the future yearly data sets from the three sources and turn them into a consolidated subset for South-Holland, so that the province can automate keeping the data up to date.

Today I contributed to a session of the open data research groups at Delft University. They do this a few times per year to discuss ongoing research and explore emerging questions that can lead to new research. I’ve taken part a few times in the past, and this time they asked me to provide an overview of what I see as current developments.

Some of the things I touched upon are similar to the remarks I made in Serbia during Open Data Week in Belgrade. The new PSI Directive proposal also was on the menu. I ended with the questions I think deserve attention. They are either about how to make sure that abstract norms get translated to the very practical, and to the local level inside government, or how to ensure that critical elements get connected and visibly stay that way (such as links between regular policy goals / teams and information management)

The slides are embedded below.

Iryna Susha and Bastiaan van Loenen in the second part of our afternoon took us through their research into the data protection steps that are in play in data collaboratives. This I found very worthwile, as data governance issues of collaborative groups (e.g. public and private entities around energy transition) are regularly surfacing in my work. Both where it threatens data sovereignty for instance, or where collaboratively pooled data can hardly be shared because it has become impossible to navigate the contractual obligations connected to the data that was pooled.

TL;DR

The European Commission proposed a new PSI Directive, that describes when and how publicly held data can be re-used by anyone (aka open government data). The proposal contains several highly interesting elements: it extends the scope to public undertakings (utilities and transport mostly) and research data, it limits the ways in which government can charge for data, introduces a high value data list which must be freely and openly available, mandates API’s, and makes de-facto exclusive arrangements transparant. It also calls for delegated powers for the EC to change practical details of the Directive in future, which opens interesting possibilities. In the coming months (years) it remains to be seen what the Member States and the European Parliament will do to weaken or strengthen this proposal.

Changes in the PSI Directive announced

On 25 April, the European Commission announced new measures to stimulate the European data economy, said to be building on the GDPR, as well as detailing the European framework for the free flow of non-personal data. The EC announced new guidelines for the sharing of scientific data, and for how businesses exchange data. It announced an action plan that increases safeguards on personal data related to health care and seeks to stimulate European cooperation on using this data. The EC also proposes to change the PSI Directive which governs the re-use of public sector information, commonly known as Open Government Data. In previous months the PSI Directive was evaluated (see an evaluation report here, in which my colleague Marc and I were involved)

This post takes a closer look at what the EC proposes for the PSI Directive. (I did the same thing when the last version was published in 2013)
This is of course a first proposal from the EC, and it may significantly change as a result of discussions with Member States and the European Parliament, before it becomes finalised and enters into law. Taking a look at the proposed new directive is of interest to see what’s new, what from an open data perspective is missing, and to see where debate with MS is most likely. Square bullets indicate the more interesting changes.

The Open Data yardstick

The original PSI Directive was adopted in 2003 and a revised version implemented in 2015. Where the original PSI Directive stems from well before the emergence of the Open Data movement, and was written with mostly ‘traditional’ and existing re-users of government information in mind, the 2015 revision already adopted some elements bringing it closer to the Open Definition. With this new proposal, again the yardstick is how it increases openness and sets minimum requirements that align with the open definition, and how much of it will be mandatory for Member States. So, scope and access rights, redress, charging and licensing, standards and formats are important. There are also some general context elements that stand out from the proposal.

A floor for the data-based society

In the recital for the proposal what jumps out is a small change in wording concerning the necessity of the PSI Directive. Where it used to say “information and knowledge” it now says “the evolution towards a data-based society influences the life of every citizen”. Towards the end of the proposal it describes the Directive as a means to improve the proper functioning of the European data economy, where it used to read ‘content industry’. The proposed directive lists minimum requirements for governments to provide data in ways that enable citizens and economic activity, but suggests Member States can and should do more, and not just stick with the floor this proposal puts in place.

Novel elements: delegated acts, public undertakings, dynamic data, high value data

There are a few novel elements spread out through the proposal that are of interest, because they seem intended to make the PSI Directive more flexible with an eye to the future.

  • The EC proposal ads the ability to create delegated acts. This would allow practical changes without the need to revise the PSI Directive and have it transposed into national law by each Member States. While this delegated power cannot be used to change the principles in the directive, it can be used to tweak it. Concerning charging, scope, licenses and formats this would provide the EC with more elbow room than the existing ability to merely provide guidance. The article is added to be able to maintain a list of ‘high value data sets’, see below.
  • Public undertakings are defined and mentioned in parallel to public sector bodies in each provision . Public undertakings are all those that are (in)directly owned by government bodies, significantly financed by them or controlled by them through regulation or decision making powers. It used to say only public sector, basically allowing governments to withdraw data from the scope of the Directive by putting them at a distance in a private entity under government control. While the scope is enlarged to include public undertakings in specific sectors only, the rest of the proposal refers to public undertakings in general. This is significant I think, given the delegated powers the EC also seeks.
  • Dynamic and real-time data is brought firmly in scope of the Directive. There have been court cases where data provision was refused on the grounds that the data did not exist when the request was made. That will no longer be possible with this proposal.
  • The EC wants to make a list of ‘high value datasets’ for which more things are mandatory (machine readable, API, free of charge, open standard license). It will create the list through the mentioned delegated powers. In my experience deciding on high value data sets is problematic (What value, how high? To whom?) and reinforces a supply-side perspective more over a demand driven approach. The Commission defines high value as “being associated with important socio-economic benefits” due to their suitability for creating services, and “the number of potential beneficiaries” of those services based on these data sets.

Access rights and scope

  • Public undertakings in specific sectors are declared within scope. These sectors are water, gas/heat, electricity, ports and airports, postal services, water transport and air transport. These public undertakings are only within scope in the sense that requests for re-use can be submitted to them. They are under no obligation to release data.
  • Research data from publicly funded research that are already made available e.g. through institution repositories are within scope. Member States shall adopt national policies to make more research data available.
  • A previous scope extension (museums, archives, libraries and university libraries) is maintained. For educational institutions a clarification is added that it only concerns tertiary education.
  • The proposed directive builds as before on existing access regimes, and only deals with the re-use of accessible data. This maintains existing differences between Member States concerning right to information.
  • Public sector bodies, although they retain any database rights they may have, cannot use those database rights to prevent or limit re-use.

Asking for documents to re-use, and redress mechanisms if denied

  • The way in which citizens can ask for data or the way government bodies can respond, has not changed
  • The redress mechanisms haven’t changed, and public undertakings, educational institutes research organisations and research funding organisations do not need to provide one.

Charging practices

  • The proposal now explicitly mentions free of charge data provision as the first option. Fees are otherwise limited to at most ‘marginal costs’
  • The marginal costs are redefined to include the costs of anonymizing data and protecting commercially confidential material. The full definition now reads “ marginal costs incurred for their reproduction, provision and dissemination and where applicable anonymisation of personal data and measures to protect commercially confidential information.” While this likely helps in making more data available, in contrast to a blanket refusal, it also looks like externalising costs on the re-user of what is essentially badly implemented data governance internally. Data holders already should be able to do this quickly and effectively for internal reporting and democratic control. Marginal costing is an important principle, as in the case of digital material it would normally mean no charges apply, but this addition seems to open up the definition to much wider interpretation.
  • The ‘marginal costs at most’ principle only applies to the public sector. Public undertakings and museum, archives etc. are excepted.
  • As before public sector bodies that are required (by law) to generate revenue to cover the costs of their public task performance are excepted from the marginal costs principle. However a previous exception for other public sector bodies having requirements to charge for the re-use of specific documents is deleted.
  • The total revenue from allowed charges may not exceed the total actual cost of producing and disseminating the data plus a reasonable return on investment. This is unchanged, but the ‘reasonable return on investment’ is now defined as at most 5 percentage points above the ECB fixed interest rate.
  • Re-use of research data and the high value data-sets must be free of charge. In practice various data sets that are currently charged for are also likely high value datasets (cadastral records, business registers for instance). Here the views of Member States are most likely to clash with those of the EC

Licensing

  • The proposal contains no explicit move towards open licenses, and retains the existing rules that standard license should be available, and those should not unnecessarily restrict re-use, nor restrict competition. The only addition is that Member States shall not only encourage public sector bodies but all data holders to use such standard licenses
  • High value data sets must have a license compatible with open standard licenses.

Non-discrimination and Exclusive agreements

  • Non-discrimination rules in how conditions for re-use are applied, including for commercial activities by the public sector itself, are continued
  • Exclusive arrangements are not allowed for public undertakings, as before for the public sector, with the same existing exceptions.
  • Where new exclusive rights are granted the arrangements now need to made public at least two months before coming into force, and the final terms of the arrangement need to be transparant and public as well.
  • Important is that any agreement or practical arrangement with third parties that in practice results in restricted availability for re-use of data other than for those third parties, also must be published two months in advance, and the final terms also made transparant and public. This concerns data sharing agreements and other collaborations where a few third parties have de facto exclusive access to data. With all the developments around smart cities where companies e.g. have access to sensor data others don’t, this is a very welcome step.

Formats and standards

  • Public undertakings will need to adhere to the same rules as the public sector already does: open standards and machine readable formats should be used for both documents and their metadata, where easily possible, but otherwise any pre-existing format and language is acceptable.
  • Both public sector bodies and public undertakings should provide API’s to dynamic data, either in real time, or if that is too costly within a timeframe that does not unduly impair the re-use potential.
  • High value data sets must be machine readable and available through an API

Let’s see how the EC takes this proposal forward, and what the reactions of the Member States and the European Parliament will be.

The US government is looking at whether to start asking money again for providing satellite imagery and data from Landsat satellites, according to an article in Nature.

Officials at the Department of the Interior, which oversees the USGS, have asked a federal advisory committee to explore how putting a price on Landsat data might affect scientists and other users; the panel’s analysis is due later this year. And the USDA is contemplating a plan to institute fees for its data as early as 2019.

To “explore how putting a price on Landsat data might affect” the users of the data, will result in predictable answers, I feel.

  • Public digital government held data, such as Landsat imagery, is both non-rivalrous and non-exclusionary.
  • The initial production costs of such data may be very high, and surely is in the case of satellite data as it involves space launches. Yet these costs are made in the execution of a public and mandated task, and as such are sunk costs. These costs are not made so others can re-use the data, but made anyway for an internal task (such as national security in this case).
  • The copying costs and distribution costs of additional copies of such digital data is marginal, tending to zero
  • Government held data usually, and certainly in the case of satellite data, constitute a (near) monopoly, with no easily available alternatives. As a consequence price elasticity is above 1: when the price of such data is reduced, the demand for it will rise non-lineary. The inverse is also true: setting a price for government data that currently is free will not mean all current users will pay, it will mean a disproportionate part of current usage will simply evaporate, and the usage will be much less both in terms of numbers of users as well as of volume of usage per user.
  • Data sales from one public entity to another publicly funded one, such as in this case academic institutions, are always a net loss to the public sector, due to administration costs, transaction costs and enforcement costs. It moves money from one pocket to another of the same outfit, but that transfer costs money itself.
  • The (socio-economic) value of re-use of such data is always higher than the possible revenue of selling that data. That value will also accrue to the public sector in the form of additional tax revenue. Loss of revenue from data sales will always over time become smaller than that. Free provision or at most at marginal costs (the true incremental cost of providing the data to one single additional user) is economically the only logical path.
  • Additionally the value of data re-use is not limited to the first order of re-use (in this case e.g. academic research it enables), but knows “downstream” higher order and network effects. E.g. the value that such academic research results create in society, in this case for instance in agriculture, public health and climatic impact mitigation. Also “upstream” value is derived from re-use, e.g. in the form of data quality improvement.

This precisely was why the data was made free in 2008 in the first place:

Since the USGS made the data freely available, the rate at which users download it has jumped 100-fold. The images have enabled groundbreaking studies of changes in forests, surface water, and cities, among other topics. Searching Google Scholar for “Landsat” turns up nearly 100,000 papers published since 2008.

That 100-fold jump in usage? That’s the price elasticity being higher than 1, I mentioned. It is a regularly occurring pattern where fees for data are dropped, whether it concerns statistics, meteo, hydrological, cadastral, business register or indeed satellite data.

The economic benefit of the free Landsat data was estimated by the USGS in 2013 at $2 billion per year, while the programme costs about $80 million per year. That’s an ROI factor for US Government of 25. If the total combined tax burden (payroll, sales/VAT, income, profit, dividend etc) on that economic benefit would only be as low as 4% it still means it’s no loss to the US government.

It’s not surprising then, when previously in 2012 a committee was asked to look into reinstating fees for Landsat data, it concluded

“Landsat benefits far outweigh the cost”. Charging money for the satellite data would waste money, stifle science and innovation, and hamper the government’s ability to monitor national security, the panel added. “It is in the U.S. national interest to fund and distribute Landsat data to the public without cost now and in the future,”

European satellite data open by design

In contrast the European Space Agency’s Copernicus program which is a multiyear effort to launch a range of Sentinel satellites for earth observation, is designed to provide free and open data. In fact my company, together with EARSC, in the past 2 years and in the coming 3 years will document over 25 cases establishing the socio-economic impact of the usage of this data, to show both primary and network effects, such as for instance for ice breakers in Finnish waters, Swedish forestry management, Danish precision farming and Dutch gas mains preventative maintenance and infrastructure subsidence.

(Nature article found via Tuula Packalen)

Which energy data is available as open data in the Netherlands, asked Peter Rukavina. He wrote about postal codes on Prince Edward Island where he lives, and in the comments I mentioned that postal codes can be used to provide granular data on e.g. energy consumption, while still aggregated enough to not disclose personally identifiable data. This as I know he is interested in energy usage and production data.

He then asked:

What kind of energy consumption data do you have at a postal code level in NL? Are your energy utilities public bodies?
Our electricity provider, and our oil and propane companies are all private, and do not release consumption data; our water utility is public, but doesn’t release consumption data and is not subject (yet) to freedom of information laws.

Let’s provide some answers.

Postal codes

Dutch postal codes have the structure ‘1234 AB’, where 12 denotes a region, 1234 denotes a village or neighbourhood, and AB a street or a section of a street. This makes them very useful as geographic references in working with data. Our postal code begins with 3825, which places it in the Vathorst neighbourhood, as shown on this list. In the image below you see the postal code 3825 demarcated on Google maps.

Postal codes are both commercially available as well as open data. Commercially available is a full set. Available as open data are only those postal codes that are connected to addresses tied to physical buildings. This as the base register of all buildings and addresses are open data in the Netherlands, and that register includes postal codes. It means that e.g. postal codes tied to P.O. Boxes are not available as open data. In practice getting at postal codes as open data is still hard, as you need to extract them from the base register, and finding that base register for download is actually hard (or at least used to be, I haven’t checked back recently).

On Energy Utilities

All energy utilities used to be publicly owned, but have since been privatised. Upon privatisation all utilities were separated into energy providers and energy transporters, called network maintainers. The network maintainers are private entities, but are publicly owned. They maintain both electricity mains as well as gas mains. There are 7 such network maintainers of varying sizes in the Netherlands

(Source: Energielevernanciers.nl

The three biggest are Liander, Enexis and Stedin.
These network maintainers, although publicly owned, are not subject to Freedom of Information requests, nor subject to the law on Re-use of Government Information. Yet they do publish open data, and are open to data requests. Liander was the first one, and Enexis and Stedin both followed. The motivation for this is that they have a key role in the government goal of achieving full energy transition by 2050 (meaning no usage of gas for heating/cooking and fully CO2 neutral), and that they are key stakeholders in this area of high public interest.

Household Energy Usage Data

Open data is published by Liander, Enexis and Stedin, though not all publish the same type of data. All publish household level energy usage data aggregated to the level of 6 position postal codes (1234 AB), in addition to asset data (including sub soil cables etc) by Enexis and Stedin. The service areas of all 7 network maintainers are also open data. The network maintainers are also all open to additional data requests, e.g. for research purposes or for municipalities or housing associations looking for data to pan for energy saving projects. Liander indicated to me in a review for the European Commission (about potential changes to the EU public data re-use regulations), that they currently deny about 2/3 of data requests received, mostly because they are uncertain about which rules and contracts apply (they hold a large pool of data contributed by various stakeholders in the field, as well as all remotely read digital metering data). They are investigating how to improve on that respons rate.

Some postal code areas are small and contain only a few addresses. In such cases this may lead to personally identifiable data, which is not allowed. Liander, Stedin and I assume Enexis as well, solve this by aggregating the average energy usage of the small area with an adjacent area until the number of addresses is at least 10.

Our address falls in the service area of Stedin. The most recent data is that of January 1st 2018, containing the energy use for all of 2017. Searching for our postal code (which covers the entire street) in their most recent CSV file yields on lines 151.624 and 625:

click for full sizeclick to enlarge

The first line shows electricity usage (ELK), and says there are 33 households in the street, and the avarage yearly usage is 4599kWh. (We are below that at around 3700kWh / year, which is higher than we were used to in our previous home). The next line provides the data for gas usage (heating and cooking) “GAS”, which is 1280 m3 on average for the 33 connections. (We are slightly below that at 1200 m3).

Last week the Danish government further extended the data available through their open data distributor, and announced some impressive resulting impact from already available data.

In 2012 the roadmap Good Basic Data for Everyone was launched, which set out to create an open national data infrastructure of the 5 core data sets used by all layers of government (maps, address, buildings, companies, people, see image). I attended the internal launch at the Ministry, and my colleague Marc contributed to the financial reasoning behind it (PDF 1, PDF 2). The roadmap ran until 2016, and a new plan is now in operation that builds on that first roadmap.


An illustration from the Danish 2012 road map showing how the 5 basic data registers correlate, and how maps are at its base.

Steadily data is added to those original 5 data sets, that increases the usability of the data. Last week all administrative geographic divisions were added (these are the geographic boundaries of municipalities, regions, 2200 parishes, jurisdictions, police districts, districts and zip-codes). This comes after last November’s addition of the place name register, and before coming May’s publication of the Danish address book. (The publication of the address database in 2002 was the original experience that ultimately led to the Basic Data program).

The primary goal of the Basic Data program has always been government efficiency, by ensuring all layers of government use the same core data. However the Danish government has also always recognised the societal and economic potential of that same data for citizens and companies, and therefore opening up the Basic Data registers as much as possible was also a key ingredient from the start. Interestingly the business case for the Basic Data program was only built on projected government savings, and those projections erred on the side of caution. Any additional savings realised by government entities would remain with them, so there was a financial incentive for government agencies to find additional uses for the Basic Data registers. External benefits from re-use were not part of the businesscase, as they were rightly seen as hard to predict and hard to measure, but were also estimated (again erring on the side of caution.) The projected savings for government were about 25 million Euro per year, and the project external benefits at some 65 million per year after completion of the system. Two years ago I transposed these Danish (as well as Dutch and other international) experiences with building an open national data infrastructure this way for the Swiss government, as part of a study with the FH Bern (PDF of some first insights presented at the 2016 Swiss open data conference in Lausanne).

Danish media this week reported new impact numbers from the geodata that has been made available. Geodata became freely available early 2013 as part of the Basic Data program. In 2017 the geodata saw over 6 billion requests for data, a 45% increase from 2016. Government research estimates the total gains in efficiency and productivity from using geodata for 2016 at some 470 million Euro (3.5 billion Danish Kroner). This is about 5 times the total of savings and benefits originally projected annually for the entire system back in 2012 (25 million savings, and 65 million in benefits).

It once again shows how there really is no rational case for selling government data, as the benefits that accrue from removing all access barriers will be much larger. This also means that government revenue will actually grow, as increased tax revenue will outstrip both lost revenue from data sales and costs of providing data. A timely and pertinent example from Denmark, now that I am researching the potential impact of open data for the Serbian government.

Last month 27 year old Slovak journalist Jan Kuciak was murdered, together with his fiancée Martina Kušnírová. As an investigative journalist, collaborating with the OCCRP, he regularly submits freedom of information requests (FOI). Recent work concerned organized crime and corruption, specifically Italian organised crime infiltrating Slovak society. His colleagues now suspect that his name and details of what he was researching have been leaked to those he was researching by way of his FOI requests, and that that made him a target. The murder of Kuciak has led to protests in Slovakia, and the Interior Minister resigned last week because of it, and [update] this afternoon the Slovakian Prime Minister resigned as well. (The PM late 2016 referred to journalists as ‘dirty anti-Slovak prostitutes‘ in the context of anti-corruption journalism and activism)

There is no EU, or wider European, standard approach to FOI. The EU regulations for re-use of government information (open data) for instance merely say they build on the local FOI regime. In some countries stating your name and stating your interest (the reason you’re asking) is mandatory, in others one or both aren’t. In the Netherlands it isn’t necessary to state an interest, and not mandatory to disclose who you are (although for obvious reasons you do need to provide contact details to receive an answer). In practice it can be helpful, in order to get a positive decision more quickly to do state your own name and explain why you’re after certain information. That also seems to be what Jan Kuciak did. Which may have allowed his investigative targets to find out about him. In various instances, especially where a FOI request concerns someone else, those others may be contacted to get consent for publication. Dutch FOI law contains such a provision, as does e.g. Serbian law concerning the anticorruption agency. Norway has a tit-for-tat mechanism built in their public income and tax database. You can find out the income and tax of any Norwegian but only by allowing your interest being disclosed to the person whose tax filings you’re looking at.

I agree with Helen Darbishire who heads Access Info Europe who says the EU should set a standard that prevents requesters being forced to disclose their identity as it potentially undermines a fundamental right, and that requester’s identities are safeguarded by governments processing those requests. Access Info called upon European Parliament to act, in an open letter signed by many other organisations.

This is the presentation I gave at the Open Belgium 2018 Conference in Louvain-la-Neuve this week, titled ‘The role and value of data inventories, a key step towards mature data governance’. The slides are embedded further below, and as PDF download at grnl.eu/in. It’s a long read (some 3000 words), so I’ll start with a summary.

Summary, TL;DR

The quality of information households in local governments is often lacking.
Things like security, openness and privacy are safeguarded by putting separate fences for each around the organisation, but those safeguards lack having detailed insight into data structures and effective corresponding processes. As archiving, security, openness and privacy in a digitised environment are basically inseparable, doing ‘everything by design’ is the only option. The only effective way is doing everything at the level of the data itself. Fences are inefficient, ineffective, and the GDPR due to its obligations will show how the privacy fence fails, forcing organisations to act. Only doing data governance for privacy is senseless, doing it also for openness, security and archiving at the same time is logical. Having good detailed inventories of your data holdings is a useful instrument to start asking the hard questions, and have meaningful conversations. It additionally allows local government to deploy open or shared data as policy instrument, and releasing the inventory itself will help articulate civic demand for data. We’ve done a range of these inventories with local government.

1: High time for mature data governance in local and regional government

Hight time! (clock in Louvain-la-Neuve)Digitisation changes how we look at things like openness, privacy, security and archiving, as it creates new affordances now that the content and its medium have become decoupled. It creates new forms of usage, and new needs to manage those. As a result of that e.g. archivists find they now need to be involved at the very start of digital information processes, whereas earlier their work would basically start when the boxes of papers were delivered to them.

The reality is that local and regional governments have barely begun to fully embrace and leverage the affordances that digitisation provides them with. It shows in how most of them deal with information security, openness and privacy: by building three fences.

Security is mostly interpreted as keeping other people out, so a fence is put between the organisation and the outside world. Inside it nothing much is changed. Similarly a second fence is put in place for determining openness. What is open can reach the outside world, and the fence is there to do the filtering. Finally privacy is also dealt with by a fence, either around the entire organisation or a specific system, keeping unwanted eyes out. All fences are a barrier between outside and in, and within the organisation usually no further measures are taken. All three fences exist separately from each other, as stand alone fixes for their singular purpose.

The first fence: security
In the Netherlands for local governments a ‘baseline information security’ standard applies, and it determines what information should be regarded as business critical. Something is business critical if its downtime will stop public service delivery, or of its lack of quality has immediate negative consequences for decision making (e.g. decisions on benefits impacting citizens). Uptime and downtime are mostly about IT infrastructure, dependencies and service level agreements, and those fit the fence tactic quite well. Quality in the context of security is about ensuring data is tamper free, doing audits, input checks, and knowing sources. That requires a data-centric approach, and it doesn’t fit the fence-around-the-organisation tactic.


The second fence: openness
Openness of local government information is mostly at request, or at best as a process separate from regular operational routines. Yet the stated end game is that everything should be actively open by design, meaning everything that can be made public will be published the moment it is publishable. We also see that open data is becoming infrastructure in some domains. The implementation of the digitisation of the law on public spaces, requires all involved stakeholders to have the same (access to) information. Many public sector bodies, both local ones and central ones like the cadastral office, have concluded that doing that through open data is the most viable way. For both the desired end game and using open data as infrastructure the fence tactic is however very inefficient.
At the same time the data sovereignty of local governments is under threat. They increasingly collaborate in networks or outsource part of their processes. In most contracts there is no attention paid to data, other than in generic terms in the general procurement conditions. We’ve come across a variety of examples where this results 1) in governments not being able to provide data to citizens, even though by law they should be able to 2) governments not being able to access their own data, only resulting graphs and reports, or 3) the slowest partner in a network determining the speed of disclosure. In short, the fence tactic is also ineffective. A more data-centric approach is needed.

The third fence: personal data protection
Mostly privacy is being dealt with by identifying privacy sensitive material (but not what, where and when), and locking it down by putting up the third fence. The new EU privacy regulations GDPR, which will be enforced from May this year, is seen as a source of uncertainty by local governments. It is also responded to in the accustomed way: reinforcing the fence, by making a ‘better’ list of what personal data is used within the organisation but still not paying much attention to processes, nor the shape and form of the personal data.
However in the case of the GDPR, if it indeed will be really enforced, this will not be enough.

GDPR an opportunity for ‘everything by design’
The GDPR confers rights to the people described by data, like the right to review, to portability, and to be forgotten. It also demands compliance is done ‘by design’, and ‘state of the art’. This can only be done by design if you are able to turn the rights of the GDPR into queries on your data, and have (automated) processes in place to deal with requests. It cannot be done with a ‘better’ fence. In the case of the GDPR, the first data related law that takes the affordances of digitisation as a given, the fence tactic is set to fail spectacularly. This makes the GDPR a great opportunity to move to a data focus not just for privacy by design, but to do openness, archiving and information security (in terms of quality) by design at the same time, as they are converging aspects of the same thing and can no longer be meaningfully separated. Detailed knowledge about your data structures then is needed.

Local governments inadvertently admit fence-tactic is failing
Governments already clearly yet indirectly admit that the fences don’t really work as tactic.
Local governments have been loudly complaining for years about the feared costs of compliance, concerning both openness and privacy. Drilling down into those complaints reveals that the feared costs concern the time and effort involved in e.g. dealing with requests. Because there’s only a fence, and usually no processes or detailed knowledge of the data they hold, every request becomes an expedition for answers. If local governments had detailed insight in the data structures, data content, and systems in use, the cost of compliance would be zero or at least indistinguishable from the rest of operations. Dealing with a request would be nothing more than running a query against their systems.

Complaints about compliance costs are essentially an admission that governments do not have their house in order when it comes to data.
The interviews I did with various stakeholders as part of the evaluation of the PSI Directive confirm this: the biggest obstacle stakeholders perceive to being more open and to realising impact with open data is the low quality of information systems and processes. It blocks fully leveraging the affordances digitisation brings.

Towards mature data governance, by making inventory
Changing tactics, doing away with the three fences, and focusing on having detailed knowledge of their data is needed. Combining what now are separate and disconnected activities (information security, openness, archiving and personal data protection), into ‘everything by design’. Basically it means turning all you know about your data into metadata that becomes part of your data. So that it will be easy to see which parts of a specific data set contain what type of person related data, which data fields are public, which subset is business critical, the records that have third party rights attached, or which records need to be deleted after a specific amount of time. Don’t man the fences where every check is always extra work, but let the data be able to tell exactly what is or is(n’t) possible, allowed, meant or needed. Getting there starts with making an inventory of what data a local or regional government currently holds, and describing the data in detailed operational, legal and technological terms.

Mature digital data governance: all aspects about the data are part of the data, allowing all processes and decisions access to all relevant material in determining what’s possible.

2: Ways local government data inventories are useful

Inventories are a key first step in doing away with the ineffective fences and towards mature data governance. Inventories are also useful as an instrument for several other purposes.

Local is where you are, but not the data pro’s
There’s a clear reason why local governments don’t have their house in order when it comes to data.
Most of our lives are local. The streets we live on, the shopping center we frequent, the schools we attend, the spaces we park in, the quality of life in our neighbourhood, the parks we walk our dogs in, the public transport we use for our commutes. All those acts are local.
Local governments have a wide variety of tasks, reflecting the variety of our acts. They hold a corresponding variety of data, connected to all those different tasks. Yet local governments are not data professionals. Unlike singular-task, data heavy national government bodies, like the Cadastre, the Meteo institute or the department for motor vehicles, local governments usually don’t have the capacity or capability. As a result local governments mostly don’t know their own data, and don’t have established effective processes that build on that data knowledge. Inventories are a first step. Inventories point to where contracts, procurement and collaboration leads to loss of needed data sovereignty. Inventories also allow determining what, from a technology perspective, is a smooth transition path to the actively open by design end-game local governments envision.

Open data as a policy instrument
Where local governments want to use the data they have as a way to enable others to act differently or in support of policy goals, they need to know in detail which data they hold and what can be done with it. Using open data as policy instrument means creating new connections between stakeholders around a policy issue, by putting the data into play. To be able to see which data could be published to engage certain stakeholders it takes knowing what you have, what it contains, and in which shape you have it first.

Better articulated citizen demands for data
Making public a list of what you have is also important here, as it invites new demand for your data. It allows people to be aware of what data exists, and contemplate if they have a use case for it. If a data set hasn’t been published yet, its existence is discoverable, so they can request it. It also enables local government to extend the data they publish based on actual demand, not assumed demand or blindly. This increases the likelihood data will be used, and increases the socio-economic impact.

Emerging data
More and more new data is emerging, from sensor networks in public and private spaces. This way new stakeholders and citizens are becoming agents in the public space, where they meet up with local governments. New relationships, and new choices result. For instance the sensor in my garden measuring temperature and humidity is part of the citizen-initiated Measure your city network, but also an element in the local governments climate change adaptation policies. For local governments as regulators, as guardian of public space, as data collector, and as source of transparency, this is a rebalancing of their position. It again takes knowing what data you own and how it relates to and complements what others collect and own. Only then is a local government able to weave a network with those stakeholders that connects data into valuable agency for all involved. (We’ve built a guidance tool, in Dutch, for the role of local government with regard to sensors in public spaces)

Having detailed data inventories are a way to start having the right conversations for local governments on all these points.

3: Getting to inventories

To create useful and detailed inventories, as I and my colleagues did for half a dozen local governments, some elements are key in my view. We looked at structured data collections only, so disregarded the thousands of individual once-off spreadsheets. They are not irrelevant, but obscure the wood for the trees. Then we scored all those data sets on up to 80(!) different facets, concerning policy domain, internal usage, current availability, technical details, legal aspects, and concerns etc. A key element in doing that is not making any assumptions:

  • don’t assume your list of applications will tell you what data you have. Not all your listed apps will be used, others won’t be on the list, and none of it tells you in detail what data actually is processed in them, just a generic pointer
  • don’t assume information management knows it all, as shadow information processes will exist outside of their view
  • don’t assume people know when you ask them how they do their work, as their description and rationalisation of their acts will not match up with reality,
    let them also show you
  • don’t assume people know the details of the data they work with, sit down with them and look at it together
  • don’t assume what it says on the tin is correct, as you’ll find things that don’t belong there (we’ve e.g. found domestic abuse data in a data set on litter in public spaces)

Doing an inventory well means

  • diving deeply into which applications are actually used,
  • talking to every unit in the organisation about their actual work and seeing it being done,
  • looking closely at data structures and real data content,
  • looking closely at current metadata and its quality
  • separately looking at large projects and programs as they tend to have their own information systems,
  • going through external communications as it may refer to internally held data not listed elsewhere,
  • looking at (procurement and collaboration) contracts to determine what claims other might have on data,
  • and then cross-referencing it all, and bringing it together in one giant list, scored on up to 80 facets.

Another essential part, especially to ensure the resulting inventory will be used as an instrument, is from the start ensuring the involvement and buy-in of the various parts of local government that usually are islands (IT, IM, legal, policy departments, archivists, domain experts, data experts). So that the inventory is something used to ask a variety of detailed questions of.

bring the islands together
Bring the islands together. (photo Dmitry Teslya CC-BY

We’ve followed various paths to do inventories, sometimes on our own as external team, sometimes in close cooperation with a client team, sometimes a guide for a client team while their operational colleagues do the actual work. All three yield very useful results but there’s a balance to strike between consistency and accuracy, the amount of feasible buy-in, and the way the hand-over is planned, so that the inventory becomes an instrument in future data-discussions.

What comes out as raw numbers is itself often counter-intuitive to local government. Some 98% of data typically held by Dutch Provinces can be public, although usually some 20% is made public (15% open data, usually geo-data). At local level the numbers are a bit different, as local governments hold much more person related data (concerning social benefits for instance, chronic care, and the persons register). About 67% of local data could be public, but only some 5% usually is. This means there’s still a huge gap between what can be open, and what is actually open. That gap is basically invisible if a local government deploys the three fences, and as a consequence they run on assumptions and overestimate the amount that needs the heaviest protection. The gap becomes visible from looking in-depth at data on all pertinent aspects by doing an inventory.

(Interested in doing an inventory of the data your organisations holds? Do get in touch.)

This week, as part of the Serbian open data week, I participated in a panel discussion, talking about international developments and experiences. A first round of comments was about general open data developments, the second round was focused on how all of that plays out on the level of local governments. This is one part of a multi-posting overview of my speaking notes.

Citizen generated data and sensors in public space

As local governments are responsible for our immediate living environment, they are also the ones most confronted with the rise in citizen generated data, and the increase in the use of sensors in our surroundings.

Where citizens generate data this can be both a clash as well as an addition to professional work with data.
A clash in the sense that citizen measurements may provide a counter argument to government positions. That the handful of sensors a local government might employ show that noise levels are within regulations, does not necessarily mean that people don’t subjectively or objectively experience it quite differently and bring the data to support their arguments.
An addition in the sense that sometimes authorities cannot measure something within accepted professional standards. The Dutch institute for environment and the Dutch meteo-office don’t measure temperatures in cities because there is no way to calibrate them (as too many factors, like heat radiance of buildings are in play). When citizens measure those temperatures and there’s a large enough number of those sensors, then trends and patterns in those measurements are however of interest to those government institutions. The exact individual measurements are still of uncertain quality, but the relative shifts are a new layer of insight. With the decreasing prices of sensors and hardware needed to collect data there will be more topics for which citizen generated data will come into existence. The Measure Your City project in my home town, for which I have an Arduino-based sensor kit in my garden is an example.

There’s a lot of potential for valuable usage of sensor data in our immediate living environment, whether citizen generated or by corporations or local government. It does mean though that local governments need to become much more aware than currently of the (unintended) consequences these projects may have. Local government needs to be extremely clear on their own different roles in this context. They are the rule-setter, the one to safeguard our public spaces, the instigator or user, and any or all of those at the same time. It needs an acute awareness of how to translate that into the way local government enters into contracts, sets limits, collaborates, and provides transparency about what exactly is happening in our shared public spaces. A recent article in the Guardian on the ‘living laboratories’ using sensor data in Dutch cities such as Utrecht, Eindhoven, Enschede and Assen shines a clear light on the type of ethical, legal and technical awareness needed. My company has recently created a design and thinking tool (in Dutch) for local governments to balance these various roles and responsibilities. This ties back to my previous point of local governments not being data professionals, and is a lack of expertise that needs to addressed.

This week, as part of the Serbian open data week, I participated in a panel discussion, talking about international developments and experiences. A first round of comments was about general open data developments, the second round was focused on how all of that plays out on the level of local governments. This is one part of a multi-posting overview of my speaking notes.

Local open data may need national data coordination

To use local open data effectively it may well mean that specific types of local data need to be available for an entire country or at least a region. Where e.g. real time parking data is useful even if it exists just for one city, for other data the interest lies in being able to make comparisons. Local spending data is much more interesting if you can compare with similar sized cities, or across all local communities. Similarly public transport data gains in usefulness if it also shows the connection with regional or national public transport. For other topics like performance metrics, maintenance, quality of public service this is true as well.

This is why in the Netherlands you see various regional initiatives where local governments join forces to provide data across a wider geographic area. In Fryslan the province, capital city of the province and the regional archive collaborate on providing one data platform, and are inviting other local governments to join. Similarly in Utrecht, North-Holland and Flevoland regional and local authorities have been collaborating in their open data efforts. For certain types of data, e.g. the real estate valuations that are used to decide local taxes, the data is combined into a national platform.

Seen from a developer’s perspective this is often true as well: if I want to build a city app that incorporates many different topics and thus data, local data is fine on its own. If I want to build something that is topic specific, e.g. finding the nearest playground, or the quality of local schools, then being able to scale it to national level may well be needed to make the application a viable proposition, regardless of the fact that the users of such an application are all only interested in one locality.

A different way of this national-local interaction is also visible. Several local governments are providing local subsets of national data sets on their own platforms, so it can be found and adopted more easily by locally interested stakeholders. An example would be for a local government to take the subset of the Dutch national address and buildings database, pertaining to their own jurisdiction only. This large data source is already open and contains addresses, and also the exact shapes of all buildings. This is likely to be very useful on a local level, and by providing a ready-to-use local subset local government saves potential local users the effort of finding their way in the enormous national data source. In that way they make local re-use more likely.

This week, as part of the Serbian open data week, I participated in a panel discussion, talking about international developments and experiences. A first round of comments was about general open data developments, the second round was focused on how all of that plays out on the level of local governments. This is one part of a multi-posting overview of my speaking notes.

Local outreach is key: open data as a policy instrument

Outreach to potential users of open data is needed, to see open data being adopted. Open data can help people and groups to change the way they do things or make decisions. It is a source of agency. Only where such agency is realized does open data create the promised value.

When local governments realize you can do this on purpose, then open data becomes a policy instrument. By releasing specific data, and by reaching out to specific stakeholders to influence behavior, open data is just as much a policy instrument as is setting regulations or providing subsidies and financing. This also means the effort and cost of open data initiatives is no longer seen as non-crucial additions to the IT budget, but gets to be compared to the costs of other interventions in the policy domain where it is used. Then you e.g. compare the effort of publishing real time parking data with measures like blocking specific roads, setting delivery windows, or placing traffic lights, as they are all part of a purposeful effort to reduce inner city traffic. In these comparisons it becomes clear how cheap open data efforts really are.

To deploy open data as a policy instrument, the starting point is to choose specific policy tasks, and around that reach out to external stakeholders to figure out what these stakeholders need to collaboratively change behaviours and outcomes.
E.g. providing digital data on all the different scenario’s for the redesign of a roundabout or busy crossing allows well informed discussions with people living near that crossing, and allows the comparison of different perspectives. In the end this reduces the number of complaints in the planning phase, increases public support for projects and can cut planning and execution time by months.

These type of interventions result in public savings and better public service outcomes, as well as in increased trust between local stakeholders and government.