The Netherlands has the lushest and tastiest grass in the world according to discerning geese, and millions flock to Dutch fields because of it. Farmers rather use the grass for their dairy cows, and don’t like the damage the geese cause to their fields. To reduce damage geese are scared away, their nests spiked, and hunted. Each year some 80.000 geese are shot in the Province South-Holland alone. The issue is that the Dutch don’t eat much wild goose, and hunters don’t like to hunt if they know the game won’t be eaten. The role of the provincial government in the case of these geese is that they compensate farmers for damage to their fields.

20190414 005 Cadzand, Grote Canadese gans
“All your base belong to us…”, Canada geese in a Dutch field (photo Jac Janssen, CC-BY)

In our open data work with the Province South-Holland we’re looking for opportunities where data can be used to increase the agency of both the province itself and external stakeholders. Part of that is talking to those stakeholders to better understand their work, the things they struggle with, and how that relates to the policy aims of the province.

So a few days ago, my colleague Rik and I met up on a farm outside Leiden, in the midst of those grass fields that the geese love, with several hunters, a civil servant, and the CEO of Hollands Wild that sells game meat to both restaurants and retail. We discussed the particular issues of hunting geese (and inspected some recently shot ones), the effort of dressing game, and the difficulties of cultivating demand for geese. Although a goose fetches a hunter just 25 cents, butchering geese is very intensive and not automated, which means that consumable meat is very expensive. Too expensive for low end use (e.g. in pet food), and even for high end use where it needs to compete with much more popular types of game, such as hare, venison and wild duck. We tasted some marinated raw goose meat and goose carpaccio. Data isn’t needed to improve communication between stakeholders on the production side (unless there emerges a market for fresh game, in contrast to the current distribution of only frozen products), but might play a role in the distribution part of the supply chain.

Today with the little one I sought out a local shop that carries Hollands Wild’s products. I bought some goose meat, and tonight we enjoyed some cold smoked goose. One goose down, 79.999 to go.

20190503_104336

20190503_104402

Open Nederland heeft een eerste podcast geproduceerd. Sebastiaan ter Burg is de gastheer en Maarten Brinkerink deed de productie en muziek.

In de Open Nederland podcast komen mensen aan het woord komen die kennis en creativiteit delen om een eerlijke, toegankelijke en innovatieve wereld te bouwen. In deze eerste aflevering gaat het over open in verschillende domeinen, zoals open overheid en open onderwijs, en hoe deze op elkaar aansluiten.

De gasten in deze aflevering zijn:

  • Wilma Haan, algemeen directeur van de Open State Foundation,
  • Jan-Bart de Vreede, domeinmanager leermiddelen en metadata van Kennisnet en
  • Maarten Zeinstra van Vereniging Open Nederland en Chapter Lead van Creative Commons Nederland.

(full disclosure: ik ben zowel bestuurslid van Open Nederland als bestuursvoorzitter van Open State Foundation, waarvan CEO Wilma Haan in deze podcast deelneemt.)

Two years ago a colleague let their dog swim in a lake without paying attention to the information signs. It turned out the water was infested with a type of algae that caused the dog irritation. Since then my colleague thought it would be great if you could somehow subscribe to notifications of when the quality of status of some nearby surface water changes.

Recently this colleague took a look at the provincial external communications concerning swimming waters. A provincial government has specific public tasks in designating swimming waters and monitoring its quality. It turns out there are six (6) public information or data sources from the particular province my colleague lives in concerning swimming waters.

My colleague compared those 6 datasets on a number of criteria: factual correctness, comparability based on an administrative index or key, and comparability on spatial / geographic aspects. Factual correctness here means whether the right objects have been represented in the data sets. Are the names, geographic location, status (safe, caution, unsafe) correct? Are details such as available amenities represented correctly everywhere?

Als ze me missen, ben ik vissen
A lake (photo by facemepls, license CC-BY)

As it turns out each of the 6 public data sets contains a different number of objects. The 6 data sets cannot be connected based on a unique key or ID. Slightly more than half of the swimming waters can be correlated across the 6 data sets by name, but a spatial/geographic connection isn’t always possible. 30% of swimming waters have the wrong status (safe/caution/unsafe) on the provincial website! And 13% of swimming waters are wrongly represented geometrically, meaning they end up in completely wrong locations and even municipalities on the map.

Every year at the start of the year the provincial government takes a decision which designates the public swimming waters. Yet the decision from this province cannot be found online (even though it was taken last February, and publication is mandatory). Only a draft decision can be found on the website of one of the municipalities concerned.

The differences in the 6 data sets are more or less reflective of the internal division of tasks of the province. Every department keeps its own files, and dataset. One is responsible for designating public swimming waters, another for monitoring swimming water quality. Yet another for making sure those swimming waters are represented in overall public planning / environmental plans. Another for the placement and location of information signs about the water quality, and still another for placing that same information on the website of the province. Every unit has their own task and keeps their own data set for it.

Which ultimately means large inconsistencies internally, and a confusing mix of information being presented to the public.

As part of my work for a Dutch regional government, I was asked to compare the open data offerings of the 12 provinces. I wanted to use something that levels the playing field for all parties compared and prevents me comparing apples to oranges, so opted for the Dutch national data portal as a source of data. An additional benefit of this is that the Dutch national portal (a CKAN instance) has a well defined API, and uses standardised vocabularies for the different government entities and functions of government.

I am interested in openness, findability, completeness, re-usability, and timeliness. For each of those I tried to pick something available through the API, that can be a proxy for one or more of those factors.

The following aspects seemed most useful:

  • openness: use of open licenses
  • findability: are datasets categorised consistently and accurately so they can be found through the policy domains they pertain to
  • completeness: does a province publish across the entire spectrum of a) national government’s list of policy domains, and b) across all 7 core tasks as listed by the association of provincial governments
  • completeness: does a province publish more than just geographic data (most of their tasks are geo-related, but definitely not all)
  • re-usability: in which formats do provinces publish, and are these a) open standards, b) machine readable, c) structured data

I could not establish a useful proxy for timeliness, as all the timestamps available through the API of the national data portal actually represent processes (when the last automatic update ran), and contain breaks (the platform was updated late last year, and all timestamps were from after that update).

Provinces publish data in three ways, and the API of the national portal makes the source of a dataset visible:

  1. they publish geographic data to the Dutch national geographic register (NGR), from which metadata is harvested into the Dutch open data portal. It used to be that only openly licensed data was harvested but since November last year also closed licensed data is being harvested into the national portal. It seems this is done by design, but this major shift has not been communicated at all.
  2. they publish non-geographic data to dataplatform.nl, a CKAN platform provided as a commercial service to government entities to host open data (as the national portal only registers metadata, and isn’t storing data). Metadata is automatically harvested into the national portal.
  3. they upload metadata directly to the national portal by hand, pointing to specific data sources online elsewhere (e.g. the API of an image library)

Most provinces only publish through the National Geo Register (NGR). Last summer I blogged about that in more detail, and nothing has changed really since then.

I measured the mentioned aspects as follows:

  • openness: a straight count of openly licensed data sets. It is national policy to use public domain, CC0 or CC-BY, and this is reflected in what provinces do. So no need to distinguish between open licenses, just between open and not-openly licensed material
  • findability: it is mandatory to categorise datasets, but voluntary to add more than one category, with a maximum of 3. I looked at the average number of categories per dataset for each province. One only categorises with one term, some consistently provide more complete categorisation, where most end up in between those two.
  • completeness: looking at those same categories, a total of 22 different ones were used. I also looked at how many of those 22 each province uses. As all their tasks are similar, the extend to which they cover all used categories is a measure for how well they publish across their spectrum of tasks. Additionally provinces have self-defined 7 core tasks, to which those categories can be mapped. So I also looked at how many of those 7 covered. There are big differences in the breadth of scope of what provinces publish.
  • completeness: while some 80% of all provincial data is geo-data and 20% non-geographic, less than 1% of open data is non-geographic data. Looking at which provinces publish non-geographic data, I used the source of it (i.e. not from the NGR), and did a quick manual check on the nature of what was published (as it was just 22 data sets out of over 3000, this was still easily done by hand).
  • re-usability: for all provinces I polled the formats in which data sets are published. Data sets can be published in multiple formats. All used formats I judged on being a) open standards, b) machine readable, c) structured data. Formats that matched all 3 got 3 points, that matched machine readable and structure but not open standards 1 points, and didn’t match structure or machine readability no points. I then divided the number of points by the total number of data formats they used. This way you get a score of at most 3, and the closer you get to 3, the more of your data matches the open definition.

As all this is based on the national portal’s API, getting the data and calculating scores can be automated as an ongoing measurement to build a time series of e.g. monthly checks to track development. My process only contained one manual action (concerning non-geo data), but it could be done automatically followed up at most with a quick manual inspection.

In terms of results (which now have been first communicated to our client), what becomes visible is that some provinces score high on a single measure, and it is easy to spot who has (automated) processes in place for one or more of the aspects looked at. Also interesting is that the overall best scoring province is not the best scoring on any of the aspects but high enough on all to have the highest average. It’s also a province that spent quite a lot of work on all steps (internally and publication) of the chain that leads to open data.

Granularity - legos, crayons, and moreGranularity (photo by Emily, license: CC-BY-NC)

A client, after their previous goal of increasing the volume of open data provided, is now looking to improve data quality. One element in this is increasing the level of detail of the already published data. They asked for input on how one can approach and define granularity. I formulated some thoughts for them as input, which I am now posting here as well.

Data granularity in general is the level of detail a data set provides. This granularity can be thought of in two dimensions:
a) whether a combination of data elements in the set is presented in one field or split out into multiple fields: atomisation
b) the relative level of detail the data in a set represents: resolution

On Atomisation
Improving this type of granularity can be done by looking at the structure of a data set itself. Are there fields within a data set that can be reliably separated into two or more fields? Common examples are separating first and last names, zipcodes and cities, streets and house numbers, organisations and departments, or keyword collections (tags, themes) into single keywords. This allows for more sophisticated queries on the data, as well as more ways it can potentially be related to or combined with other data sets.

For currently published data sets improving this type of granularity can be done by looking at the existing data structure directly, or by asking the provider of the data set if they have combined any fields into a single field when they created the dataset for publication.

This type of granularity increase changes the structure of the data but not the data itself. It improves the usability of the data, without improving the use value of the data. The data in terms of information content stays the same, but does become easier to work with.

On Resolution
Resolution can have multiple components such as: frequency of renewal, time frames represented, geographic resolution, or splitting categories into sub-categories or multilevel taxonomies. An example is how one can publish average daily temperature in a region. Let’s assume it is currently published monthly with one single value per day. Resolution of such a single value can be increased in multiple ways: publishing the average daily temperature daily, not monthly. Split up the average daily temperature for the region, into average daily temperature per sensor in that region (geographic resolution). Split up the average single sensor reading into hourly actual readings, or even more frequent. The highest resolution would be publishing real-time individual sensor readings continuously.

Improving resolution can only be done in collaboration with the holder of the actual source of the data. What level of improvement can be attained is determined by:

  1. The level of granularity and frequency at which the data is currently collected by the data holder
  2. The level of granularity or aggregation at which the data is used by the data holder for their public tasks
  3. The level of granularity or aggregation at which the data meets professional standards.

Item 1 provides an absolute limit to what can be done: what isn’t collected cannot be published. Usually however data is not used internally in the exact form it was collected either. In terms of access to information the practical limit to what can be published is usually the way that data is available internally for the data holder’s public tasks. Internal systems and IT choices are shaped accordingly usually. Generally data holders can reliably provide data at the level of Item 2, because that is what they work with themselves.

However, there are reasons why data sometimes cannot be publicly provided the same way it is available to the data holder internally. These can be reasons of privacy or common professional standards. For instance energy companies have data on energy usage per household, but in the Netherlands such data is aggregated to groups of at least 10 households before publication because of privacy concerns. National statistics agencies comply with international standards concerning how data is published for external use. Census data for instance will never be published in the way it was collected, but only at various levels of aggregation.

Discussions on the desired level of resolution need to be in collaboration with potential re-users of the data, not just the data holders. At what point does data become useful for different or novel types of usage? When is it meeting needs adequately?

Together with data holders and potential data re-users the balance needs to be struck between re-use value and considerations of e.g. privacy and professional standards.

This type of granularity increase changes the content of the data. It improves the usage value of the data as it allows new types of queries on the data, and enables more nuanced contextualisation in combination with other datasets.

This article is a good description of the Freedom of Information (#foia #opengov #opendata) situation in the Balkans. Due to my work in the region, I recognise lots of what is described here. My work in the region, such as in Serbia, has let me encounter various institutions willing to use evasive action to prevent the release of information.

In essence this is not all that different from what (decentral) government entities in other European countries do as well. Many of them still see increased transparency and access as a distraction absorbing work and time they’d rather spend elsewhere. Yet, there’s a qualitative difference in the level of obstruction. It’s the difference between acknowledging there is a duty to be transparant but being hesitant, and not believing that there’s such a duty in governance at all.

Secrecy, sometimes in combination with corruption, has a long and deep history. In Central Asia for instance I encountered an example that the number of agricultural machines wasn’t released, as a 1950’s Soviet law still on the books marked it as a state secret (because tractors could be mobilised in case of war). More disturbingly such state secrecy laws are abused to tackle political opponents in Central Asia as well. When a government official releases information based on a transparency regulation, or as part of policy implementation, political opponents might denounce them for giving away state secrets and take them to court risking jail time even.

There is a strong effort to increase transparency visible in the Balkan region as well. Both inside government, as well as in civil society. Excellent examples exist. But it’s an ongoing struggle between those seeing power as its own purpose and those seeking high quality governance. We’ll see steps forward, backwards, rear guard skirmishes and a mixed bag of results for a long time. Especially there where there are high levels of distrust amongst the wider population, not just towards government but towards each other.

One such excellent example is the work of the Serbian information commissioner Sabic. Clearly seeing his role as an ombudsman for the general population, he and his office led by example during the open data work I contributed to in the past years. By publishing statistics on information requests, complaints and answer times, and by publishing a full list of all Serbian institutions that fall under the remit of the Commission for Information of Public Importance and Personal Data Protection. This last thing is key, as some institutions will simply stall requests by stating transparency rules do not apply to them. Mr. Sabic’s term ended at the end of last year. A replacement for his position hasn’t been announced yet, which is both a testament to Mr Sabic’s independent role as information commissioner, and to the risk of less transparency inclined forces trying to get a much less independent successor.

Bookmarked Right to Know: A Beginner’s Guide to State Secrecy / Balkan Insight by Dusica Pavlovic (Balkan Insight)

Governments in the Balkans are chipping away at transparency laws to make it harder for journalists and activists to hold power to account.