The Netherlands has the lushest and tastiest grass in the world according to discerning geese, and millions flock to Dutch fields because of it. Farmers rather use the grass for their dairy cows, and don’t like the damage the geese cause to their fields. To reduce damage geese are scared away, their nests spiked, and hunted. Each year some 80.000 geese are shot in the Province South-Holland alone. The issue is that the Dutch don’t eat much wild goose, and hunters don’t like to hunt if they know the game won’t be eaten. The role of the provincial government in the case of these geese is that they compensate farmers for damage to their fields.

20190414 005 Cadzand, Grote Canadese gans
“All your base belong to us…”, Canada geese in a Dutch field (photo Jac Janssen, CC-BY)

In our open data work with the Province South-Holland we’re looking for opportunities where data can be used to increase the agency of both the province itself and external stakeholders. Part of that is talking to those stakeholders to better understand their work, the things they struggle with, and how that relates to the policy aims of the province.

So a few days ago, my colleague Rik and I met up on a farm outside Leiden, in the midst of those grass fields that the geese love, with several hunters, a civil servant, and the CEO of Hollands Wild that sells game meat to both restaurants and retail. We discussed the particular issues of hunting geese (and inspected some recently shot ones), the effort of dressing game, and the difficulties of cultivating demand for geese. Although a goose fetches a hunter just 25 cents, butchering geese is very intensive and not automated, which means that consumable meat is very expensive. Too expensive for low end use (e.g. in pet food), and even for high end use where it needs to compete with much more popular types of game, such as hare, venison and wild duck. We tasted some marinated raw goose meat and goose carpaccio. Data isn’t needed to improve communication between stakeholders on the production side (unless there emerges a market for fresh game, in contrast to the current distribution of only frozen products), but might play a role in the distribution part of the supply chain.

Today with the little one I sought out a local shop that carries Hollands Wild’s products. I bought some goose meat, and tonight we enjoyed some cold smoked goose. One goose down, 79.999 to go.

20190503_104336

20190503_104402

As part of my work for a Dutch regional government, I was asked to compare the open data offerings of the 12 provinces. I wanted to use something that levels the playing field for all parties compared and prevents me comparing apples to oranges, so opted for the Dutch national data portal as a source of data. An additional benefit of this is that the Dutch national portal (a CKAN instance) has a well defined API, and uses standardised vocabularies for the different government entities and functions of government.

I am interested in openness, findability, completeness, re-usability, and timeliness. For each of those I tried to pick something available through the API, that can be a proxy for one or more of those factors.

The following aspects seemed most useful:

  • openness: use of open licenses
  • findability: are datasets categorised consistently and accurately so they can be found through the policy domains they pertain to
  • completeness: does a province publish across the entire spectrum of a) national government’s list of policy domains, and b) across all 7 core tasks as listed by the association of provincial governments
  • completeness: does a province publish more than just geographic data (most of their tasks are geo-related, but definitely not all)
  • re-usability: in which formats do provinces publish, and are these a) open standards, b) machine readable, c) structured data

I could not establish a useful proxy for timeliness, as all the timestamps available through the API of the national data portal actually represent processes (when the last automatic update ran), and contain breaks (the platform was updated late last year, and all timestamps were from after that update).

Provinces publish data in three ways, and the API of the national portal makes the source of a dataset visible:

  1. they publish geographic data to the Dutch national geographic register (NGR), from which metadata is harvested into the Dutch open data portal. It used to be that only openly licensed data was harvested but since November last year also closed licensed data is being harvested into the national portal. It seems this is done by design, but this major shift has not been communicated at all.
  2. they publish non-geographic data to dataplatform.nl, a CKAN platform provided as a commercial service to government entities to host open data (as the national portal only registers metadata, and isn’t storing data). Metadata is automatically harvested into the national portal.
  3. they upload metadata directly to the national portal by hand, pointing to specific data sources online elsewhere (e.g. the API of an image library)

Most provinces only publish through the National Geo Register (NGR). Last summer I blogged about that in more detail, and nothing has changed really since then.

I measured the mentioned aspects as follows:

  • openness: a straight count of openly licensed data sets. It is national policy to use public domain, CC0 or CC-BY, and this is reflected in what provinces do. So no need to distinguish between open licenses, just between open and not-openly licensed material
  • findability: it is mandatory to categorise datasets, but voluntary to add more than one category, with a maximum of 3. I looked at the average number of categories per dataset for each province. One only categorises with one term, some consistently provide more complete categorisation, where most end up in between those two.
  • completeness: looking at those same categories, a total of 22 different ones were used. I also looked at how many of those 22 each province uses. As all their tasks are similar, the extend to which they cover all used categories is a measure for how well they publish across their spectrum of tasks. Additionally provinces have self-defined 7 core tasks, to which those categories can be mapped. So I also looked at how many of those 7 covered. There are big differences in the breadth of scope of what provinces publish.
  • completeness: while some 80% of all provincial data is geo-data and 20% non-geographic, less than 1% of open data is non-geographic data. Looking at which provinces publish non-geographic data, I used the source of it (i.e. not from the NGR), and did a quick manual check on the nature of what was published (as it was just 22 data sets out of over 3000, this was still easily done by hand).
  • re-usability: for all provinces I polled the formats in which data sets are published. Data sets can be published in multiple formats. All used formats I judged on being a) open standards, b) machine readable, c) structured data. Formats that matched all 3 got 3 points, that matched machine readable and structure but not open standards 1 points, and didn’t match structure or machine readability no points. I then divided the number of points by the total number of data formats they used. This way you get a score of at most 3, and the closer you get to 3, the more of your data matches the open definition.

As all this is based on the national portal’s API, getting the data and calculating scores can be automated as an ongoing measurement to build a time series of e.g. monthly checks to track development. My process only contained one manual action (concerning non-geo data), but it could be done automatically followed up at most with a quick manual inspection.

In terms of results (which now have been first communicated to our client), what becomes visible is that some provinces score high on a single measure, and it is easy to spot who has (automated) processes in place for one or more of the aspects looked at. Also interesting is that the overall best scoring province is not the best scoring on any of the aspects but high enough on all to have the highest average. It’s also a province that spent quite a lot of work on all steps (internally and publication) of the chain that leads to open data.

Dutch Provinces publish open data, but it always looks like it is mostly geo-data, and hardly anything else. When talking to provinces I also get the feeling they struggle to think of data that isn’t of a geographic nature. That isn’t very surprising, a lot of the public tasks carried out by provinces have to do with spatial planning, nature and environment, and geographic data is a key tool for them. But now that we are aiding several provinces with extending their data provision, I wanted to find out in more detail.

My colleague Niene took the API of the Dutch national open data portal for a spin, and made a list of all datasets listed as stemming from a province.
I took that list and zoomed in on various aspects.

At first glance there are strong differences between the provinces: some publish a lot, others hardly anything. The Province of Utrecht publishes everything twice to the national data portal, once through the national geo-register, once through their own dataplatform. The graph below has been corrected for it.

What explains those differences? And what is the nature of the published datasets?

Geo-data is dominant
First I made a distinction between data that stems from the national geo-register to which all provinces publish, and data that stems from another source (either regional dataplatforms, or for instance direct publication through the national open data portal). The NGR is theoretically the place where all provinces share geo-data with other government entities, part of which is then marked as publicly available. In practice the numbers suggest Provinces roughly publish to the NGR in the same proportions as the graph above (meaning that of what they publish in the NGR they mark about the same percentage as open data)

  • Of the over 3000 datasets that are published by provinces as open data in the national open data portal, only 48 don’t come from the national geo-register. This is about 1.5%.
  • Of the 12 provinces, 4 do not publish anything outside the NGR: Noord-Brabant, Zeeland, Flevoland, Overijssel.

Drenthe stands out in terms of numbers of geo-data sets published, over 900. A closer look at their list shows that they publish more historic data, and that they seem to be more complete (more of what they share in the NGR is marked for open data apparantly.) The average is between 200-300, with provinces like Zuid-Holland, Noord-Holland, Gelderland, Utrecht, Groningen, and Fryslan in that range. Overijssel, like Drenthe publishes more, though less than Drenthe at about 500. This seems to be the result of a direct connection to the NGR from their regional geo-portal, and thus publishing by default. Overijssel deliberately does not publish historic data explaining some of the difference with Drenthe. (When something is updated in Overijssel the previous version is automatically removed. This clashes with open data good practice, but is currently hard to fix in their processes.)

If it isn’t geo, it hardly exists
Of the mere 48 data sets outside the NGR, just 22 (46%) are not geo-related. Overall this means that less than 1% of all open data provinces publish is not geo-data.
Of those 22, exactly half are published by Zuid-Holland alone. They for instance publish several photo-archives, a subsidy register, politician’s expenses, and formal decisions.
Fryslan is the only province publishing an inventory of their data holdings, which is 1 of their only 3 non geo-data sets.
Gelderland stands out as the single province that publishes all their geo data through the NGR, hinting at a neatly organised process. Their non-NGR open data is also all non-geo (as it should be). They publish 27% of all open non-geo data by provinces, together with Zuid-Holland account for 77% of it all.

Taking these numbers and comparing them to inventories like the one Fryslan publishes (which we made for them in 2016), and the one for Noord-Holland (which we did in 2013), the dominance of geo-data is not surprising in itself. Roughly 80% of data provinces hold is geo related. Just about a fifth to a quarter of this geo-data (15%-20% of the total) is on average published at the moment, yet it makes up over 99% of all provincial open data published. This lopsidedness means that hardly anything on the inner workings of a province, the effectivity of policy implementation etc. is available as open data.

Where the opportunities are
To improve both on the volume and on the breadth of scope of the data provinces publish, two courses of action stand open.
First, extending the availability of geo-data provinces hold. Most provinces will have a clear process for this, and it should therefore be relatively easy to do. It should therefore be possible for most provinces to get to where Drenthe currently is.
Second, take a much closer look at the in-house data that is not geo-related. About 20% of dataholdings fall in this category, and based on the inventories we did, some 90% of that should be publishable, maybe after some aggregation or other adaptations.
The lack of an inventory is an obstacle here, but existing inventories should at least be able to point the other provinces in the right direction.

Make the provision of provincial open geodata complete, embrace its dominance and automate it with proper data governance. Focus your energy on publishing ‘the rest’ where all the data on the inner workings of the province is. Provinces perpetually complain nobody is aware of what they are doing and their role in Dutch governance. Make it visible, publish your data. Stop making yourself invisible behind a stack of maps only.

(a Dutch version is available. Een Nederlandse versie van deze blogpost vind je bij The Green Land.)

Last week ten of the twelve Dutch Provinces met at the South-Holland Provincial government to discuss open data, and exchange experiences, seeking to inspire each other to do more on open government data. I participated as part of my roles as open data project lead for both the Province of Overijssel, and the Province Fryslân.

There were several topics of discussion.

  • The National Open Government Action Plan (part of the OGP effort), a new version of which is due next spring, and for which input is currently sought by the Dutch government.
  • A proposal by the team behind the national open data platform to form a ‘high value data list’ for provincial data sets.
  • Several examples were discussed of (open) data being used to enhance public interaction.

I want to briefly show those examples (and might blog about the other two later).

Make it usable, connect to what is really of significance to people
Basically the three examples that were presented during the session present two lessons:

1) Make data usable, by presenting them better and allow for more interaction. That way you more or less take up position half-way between what is/was common (presenting only abstracted information), and open data (the raw detailed data): presenting data in a much more detailed way, and making it possible for others to interact with the data and explore.

2) Connect to what people really care about. It is easy to assume what others would want to know or would need in terms of data, it is less easy to actually go outside and listen to people and entrepreneurs first what type of data they need around specific topics. However, it does provide lots of vital clues as to what data will actually find usage, and what type of questions people want to be able to solve for themselves.

That second point is something we always stress in our work with governments, so I was glad to hear it presented at the session.

There were three examples presented.

South-Holland put subsidies on a map
The Province of South-Holland made a map that shows where subsidies are provided and for what. It was made to better present to the public the data that exists about subsidies, als in order to stimulate people to dive deeper into the data. The map links to where the actual underlying data should be found (but as far as I can tell, the data isn’t actually provided there). A key part of the presentation was about the steps they took to make the data presentable in the first place, and how they created a path for doing that which can be re-used for other types of data they are seeking to house in their newly created data warehouse. This way presenting other data sources in similar ways will be less work.


The subsidy map

Gelderland provides insight into their audit-work
Provinces have a task in auditing municipal finances. The Province of Gelderland has used an existing tool (normally used for presenting statistical data) to provide more detail about the municipal finances they audited. Key point here again was to show how to present data better to the public, how that plays a role in communicating with municipalities as well, and how it provides stepping stones to entice people to dive deeper. The tool they use provides download links for the underlying data (although the way that is done can still be significantly improved, as it currently only allows downloads of selections you made, so you’d have to sticht them back together to reconstruct the full data set)



Screenshot of the Gelderland audit data tool

Flevoland listens first, then publishes data
The last example presented was much less about the data, and much more about the ability to really engage with citizens, civil society and businesses and to stimulate the usage of open data that way. The Province Flevoland is planning major renovation work on bridges and water locks in the coming years, and their aim is to reduce hindrance. Therefore they already now, before work is starting, are having conversations with various people that live near or regularly pass by the objects that will be renovated. To hear what type of data might help them to less disrupt their normal routines. Resulting insights are that where currently plans are published in a generic way, much more specific localized data is needed, as well as much more detailed data about what is going to happen in a few days time. This allows people to be flexible, such as a farmer deciding to harvest a day later, or to move the harvest aways over water and not the road. Detailed data also means communicating small changes and delays in the plans. Choosing the right channels is important too. Currently e.g. the Province announces construction works on Twitter, but no local farmer goes there for information. They do use a specific platform for farmers where they also get detailed data about weather, water etc, and distributing localized data on construction works there would be much more useful. So now they will collaborate with that platform to reach farmers better. (My company The Green Land is supporting the Province, 2 municipalities and the water board in the province, in this project)


Overview of the 16 bridges and waterlocks that will be renovated in the coming years


Various stakeholders around each bridge or waterlock are being approached