Open Nederland heeft een eerste podcast geproduceerd. Sebastiaan ter Burg is de gastheer en Maarten Brinkerink deed de productie en muziek.

In de Open Nederland podcast komen mensen aan het woord komen die kennis en creativiteit delen om een eerlijke, toegankelijke en innovatieve wereld te bouwen. In deze eerste aflevering gaat het over open in verschillende domeinen, zoals open overheid en open onderwijs, en hoe deze op elkaar aansluiten.

De gasten in deze aflevering zijn:

  • Wilma Haan, algemeen directeur van de Open State Foundation,
  • Jan-Bart de Vreede, domeinmanager leermiddelen en metadata van Kennisnet en
  • Maarten Zeinstra van Vereniging Open Nederland en Chapter Lead van Creative Commons Nederland.

(full disclosure: ik ben zowel bestuurslid van Open Nederland als bestuursvoorzitter van Open State Foundation, waarvan CEO Wilma Haan in deze podcast deelneemt.)

Where German Easter fires burn on Saturday evening, Dutch Easter fires burn on Easter Sunday. So this Easter Monday morning it’s time to look at the second spike of PM10 pollution in the air. The smell in the garden is as strong as yesterday.

The sensor grid shows a much more muted picture this morning. First the same sensors as I looked at yesterday.

Ter Apel (on the German border, have their own fire on Sunday evening, had an extreme reading after the German fires), shows twice the norm. Still a high outlier, but it pales in comparison to the 5 times the norm reading a day earlier. The peak also dissipates more quickly.

Upwind from us, in the Flevo polders, it is a similar picture, a less distinct peak than yesterday but still well above twice the norm.

And near us in Utrecht the readings are actually about the same as yesterday. That matches with my perception that the smell around our house is about the same as yesterday. It also implies that though yesterdays fires were much closer, they were perhaps less in numbers (some were cancelled due to drought) or intensity, or they weren’t actually as neatly upwind from us as the German fires and passed to the south of us.

The latter seems to be borne out by readings from some of the other sensors.
First Eibergen, on the border between the Twente and Achterhoek regions, an area with lots of Easter fires.

Eibergen shows a higher peak due to the Sunday fires than the day before, yet both peaks are in the same range at 2 to 2.5 times the norm.

South and east of the region we see similar patterns.
In Nijmegen more southern, the peak is higher than the day before, because they were not downwind of many German fires.

On the Veluwe, which is more eastern and closer to us, the peak is again lower than the day before yet still distinct.

Overall the pollution of Sunday’s fires is less visible across the Netherlands. Where Saturday’s fires made sensors go into the red from the north-eastern border, southwesterly across the country to Amsterdam, for Sunday’s fires such a clear corridor doesn’t show.

It’s only morning on Easter Sunday, but apparently in Germany, over 160 kilometers away, Easter fires have been burning on Saturday evening. This morning we woke up to a distinct smell of burning outside (and not just of the wood burning type of smell, also plastics). Dutch Easter fires usually burn on Easter Sunday, not the evening before. So we looked up if there had been a nearby fire, but no, it’s Easter fires from far away.

The national air quality sensor grid documents the spike in airborne particles clearly.
First a sensor near where E’s parents live, on the border with Germany.

A clear PM10 spike starts on Saturday evening, and keeps going throughout the night. It tops out at well over 200 microgram per cubic meter of air at 6 am this morning, or over 5 times the annual average norm deemed acceptable.

The second graph below is on a busy road in Utrecht, about 20 mins from here, and 180 kilometers from the previous sensor. The spike starts during the night, when the wind has finally blown the smoke here, and is at just over 80 microgram per cubic meter of air at 8 am, or double the annual average norm deemed acceptable.

This likely isn’t the peak value yet, as a sensor reading upwind from us shows readings still rising at 9 am:

On a map the sensor points show how the smoke is coming from the north east. The red dot at the top right is Ter Apel, the first sensor reading shown above, the other red points moving west and south have their peaks later or are still showing a rise in PM10 values.

The German website luftdaten.info also shows nicely how the smoke from the north eastern part of Germany, between Oldenburg and the border with the Netherlands is moving across the Netherlands.

The wind isn’t going to change much, so tomorrow the smell will likely be worse, as by then all the Easter fires from Twente will have burnt as well, adding their emissions to the mix.

Two years ago a colleague let their dog swim in a lake without paying attention to the information signs. It turned out the water was infested with a type of algae that caused the dog irritation. Since then my colleague thought it would be great if you could somehow subscribe to notifications of when the quality of status of some nearby surface water changes.

Recently this colleague took a look at the provincial external communications concerning swimming waters. A provincial government has specific public tasks in designating swimming waters and monitoring its quality. It turns out there are six (6) public information or data sources from the particular province my colleague lives in concerning swimming waters.

My colleague compared those 6 datasets on a number of criteria: factual correctness, comparability based on an administrative index or key, and comparability on spatial / geographic aspects. Factual correctness here means whether the right objects have been represented in the data sets. Are the names, geographic location, status (safe, caution, unsafe) correct? Are details such as available amenities represented correctly everywhere?

Als ze me missen, ben ik vissen
A lake (photo by facemepls, license CC-BY)

As it turns out each of the 6 public data sets contains a different number of objects. The 6 data sets cannot be connected based on a unique key or ID. Slightly more than half of the swimming waters can be correlated across the 6 data sets by name, but a spatial/geographic connection isn’t always possible. 30% of swimming waters have the wrong status (safe/caution/unsafe) on the provincial website! And 13% of swimming waters are wrongly represented geometrically, meaning they end up in completely wrong locations and even municipalities on the map.

Every year at the start of the year the provincial government takes a decision which designates the public swimming waters. Yet the decision from this province cannot be found online (even though it was taken last February, and publication is mandatory). Only a draft decision can be found on the website of one of the municipalities concerned.

The differences in the 6 data sets are more or less reflective of the internal division of tasks of the province. Every department keeps its own files, and dataset. One is responsible for designating public swimming waters, another for monitoring swimming water quality. Yet another for making sure those swimming waters are represented in overall public planning / environmental plans. Another for the placement and location of information signs about the water quality, and still another for placing that same information on the website of the province. Every unit has their own task and keeps their own data set for it.

Which ultimately means large inconsistencies internally, and a confusing mix of information being presented to the public.

As part of my work for a Dutch regional government, I was asked to compare the open data offerings of the 12 provinces. I wanted to use something that levels the playing field for all parties compared and prevents me comparing apples to oranges, so opted for the Dutch national data portal as a source of data. An additional benefit of this is that the Dutch national portal (a CKAN instance) has a well defined API, and uses standardised vocabularies for the different government entities and functions of government.

I am interested in openness, findability, completeness, re-usability, and timeliness. For each of those I tried to pick something available through the API, that can be a proxy for one or more of those factors.

The following aspects seemed most useful:

  • openness: use of open licenses
  • findability: are datasets categorised consistently and accurately so they can be found through the policy domains they pertain to
  • completeness: does a province publish across the entire spectrum of a) national government’s list of policy domains, and b) across all 7 core tasks as listed by the association of provincial governments
  • completeness: does a province publish more than just geographic data (most of their tasks are geo-related, but definitely not all)
  • re-usability: in which formats do provinces publish, and are these a) open standards, b) machine readable, c) structured data

I could not establish a useful proxy for timeliness, as all the timestamps available through the API of the national data portal actually represent processes (when the last automatic update ran), and contain breaks (the platform was updated late last year, and all timestamps were from after that update).

Provinces publish data in three ways, and the API of the national portal makes the source of a dataset visible:

  1. they publish geographic data to the Dutch national geographic register (NGR), from which metadata is harvested into the Dutch open data portal. It used to be that only openly licensed data was harvested but since November last year also closed licensed data is being harvested into the national portal. It seems this is done by design, but this major shift has not been communicated at all.
  2. they publish non-geographic data to dataplatform.nl, a CKAN platform provided as a commercial service to government entities to host open data (as the national portal only registers metadata, and isn’t storing data). Metadata is automatically harvested into the national portal.
  3. they upload metadata directly to the national portal by hand, pointing to specific data sources online elsewhere (e.g. the API of an image library)

Most provinces only publish through the National Geo Register (NGR). Last summer I blogged about that in more detail, and nothing has changed really since then.

I measured the mentioned aspects as follows:

  • openness: a straight count of openly licensed data sets. It is national policy to use public domain, CC0 or CC-BY, and this is reflected in what provinces do. So no need to distinguish between open licenses, just between open and not-openly licensed material
  • findability: it is mandatory to categorise datasets, but voluntary to add more than one category, with a maximum of 3. I looked at the average number of categories per dataset for each province. One only categorises with one term, some consistently provide more complete categorisation, where most end up in between those two.
  • completeness: looking at those same categories, a total of 22 different ones were used. I also looked at how many of those 22 each province uses. As all their tasks are similar, the extend to which they cover all used categories is a measure for how well they publish across their spectrum of tasks. Additionally provinces have self-defined 7 core tasks, to which those categories can be mapped. So I also looked at how many of those 7 covered. There are big differences in the breadth of scope of what provinces publish.
  • completeness: while some 80% of all provincial data is geo-data and 20% non-geographic, less than 1% of open data is non-geographic data. Looking at which provinces publish non-geographic data, I used the source of it (i.e. not from the NGR), and did a quick manual check on the nature of what was published (as it was just 22 data sets out of over 3000, this was still easily done by hand).
  • re-usability: for all provinces I polled the formats in which data sets are published. Data sets can be published in multiple formats. All used formats I judged on being a) open standards, b) machine readable, c) structured data. Formats that matched all 3 got 3 points, that matched machine readable and structure but not open standards 1 points, and didn’t match structure or machine readability no points. I then divided the number of points by the total number of data formats they used. This way you get a score of at most 3, and the closer you get to 3, the more of your data matches the open definition.

As all this is based on the national portal’s API, getting the data and calculating scores can be automated as an ongoing measurement to build a time series of e.g. monthly checks to track development. My process only contained one manual action (concerning non-geo data), but it could be done automatically followed up at most with a quick manual inspection.

In terms of results (which now have been first communicated to our client), what becomes visible is that some provinces score high on a single measure, and it is easy to spot who has (automated) processes in place for one or more of the aspects looked at. Also interesting is that the overall best scoring province is not the best scoring on any of the aspects but high enough on all to have the highest average. It’s also a province that spent quite a lot of work on all steps (internally and publication) of the chain that leads to open data.

Granularity - legos, crayons, and moreGranularity (photo by Emily, license: CC-BY-NC)

A client, after their previous goal of increasing the volume of open data provided, is now looking to improve data quality. One element in this is increasing the level of detail of the already published data. They asked for input on how one can approach and define granularity. I formulated some thoughts for them as input, which I am now posting here as well.

Data granularity in general is the level of detail a data set provides. This granularity can be thought of in two dimensions:
a) whether a combination of data elements in the set is presented in one field or split out into multiple fields: atomisation
b) the relative level of detail the data in a set represents: resolution

On Atomisation
Improving this type of granularity can be done by looking at the structure of a data set itself. Are there fields within a data set that can be reliably separated into two or more fields? Common examples are separating first and last names, zipcodes and cities, streets and house numbers, organisations and departments, or keyword collections (tags, themes) into single keywords. This allows for more sophisticated queries on the data, as well as more ways it can potentially be related to or combined with other data sets.

For currently published data sets improving this type of granularity can be done by looking at the existing data structure directly, or by asking the provider of the data set if they have combined any fields into a single field when they created the dataset for publication.

This type of granularity increase changes the structure of the data but not the data itself. It improves the usability of the data, without improving the use value of the data. The data in terms of information content stays the same, but does become easier to work with.

On Resolution
Resolution can have multiple components such as: frequency of renewal, time frames represented, geographic resolution, or splitting categories into sub-categories or multilevel taxonomies. An example is how one can publish average daily temperature in a region. Let’s assume it is currently published monthly with one single value per day. Resolution of such a single value can be increased in multiple ways: publishing the average daily temperature daily, not monthly. Split up the average daily temperature for the region, into average daily temperature per sensor in that region (geographic resolution). Split up the average single sensor reading into hourly actual readings, or even more frequent. The highest resolution would be publishing real-time individual sensor readings continuously.

Improving resolution can only be done in collaboration with the holder of the actual source of the data. What level of improvement can be attained is determined by:

  1. The level of granularity and frequency at which the data is currently collected by the data holder
  2. The level of granularity or aggregation at which the data is used by the data holder for their public tasks
  3. The level of granularity or aggregation at which the data meets professional standards.

Item 1 provides an absolute limit to what can be done: what isn’t collected cannot be published. Usually however data is not used internally in the exact form it was collected either. In terms of access to information the practical limit to what can be published is usually the way that data is available internally for the data holder’s public tasks. Internal systems and IT choices are shaped accordingly usually. Generally data holders can reliably provide data at the level of Item 2, because that is what they work with themselves.

However, there are reasons why data sometimes cannot be publicly provided the same way it is available to the data holder internally. These can be reasons of privacy or common professional standards. For instance energy companies have data on energy usage per household, but in the Netherlands such data is aggregated to groups of at least 10 households before publication because of privacy concerns. National statistics agencies comply with international standards concerning how data is published for external use. Census data for instance will never be published in the way it was collected, but only at various levels of aggregation.

Discussions on the desired level of resolution need to be in collaboration with potential re-users of the data, not just the data holders. At what point does data become useful for different or novel types of usage? When is it meeting needs adequately?

Together with data holders and potential data re-users the balance needs to be struck between re-use value and considerations of e.g. privacy and professional standards.

This type of granularity increase changes the content of the data. It improves the usage value of the data as it allows new types of queries on the data, and enables more nuanced contextualisation in combination with other datasets.

This article is a good description of the Freedom of Information (#foia #opengov #opendata) situation in the Balkans. Due to my work in the region, I recognise lots of what is described here. My work in the region, such as in Serbia, has let me encounter various institutions willing to use evasive action to prevent the release of information.

In essence this is not all that different from what (decentral) government entities in other European countries do as well. Many of them still see increased transparency and access as a distraction absorbing work and time they’d rather spend elsewhere. Yet, there’s a qualitative difference in the level of obstruction. It’s the difference between acknowledging there is a duty to be transparant but being hesitant, and not believing that there’s such a duty in governance at all.

Secrecy, sometimes in combination with corruption, has a long and deep history. In Central Asia for instance I encountered an example that the number of agricultural machines wasn’t released, as a 1950’s Soviet law still on the books marked it as a state secret (because tractors could be mobilised in case of war). More disturbingly such state secrecy laws are abused to tackle political opponents in Central Asia as well. When a government official releases information based on a transparency regulation, or as part of policy implementation, political opponents might denounce them for giving away state secrets and take them to court risking jail time even.

There is a strong effort to increase transparency visible in the Balkan region as well. Both inside government, as well as in civil society. Excellent examples exist. But it’s an ongoing struggle between those seeing power as its own purpose and those seeking high quality governance. We’ll see steps forward, backwards, rear guard skirmishes and a mixed bag of results for a long time. Especially there where there are high levels of distrust amongst the wider population, not just towards government but towards each other.

One such excellent example is the work of the Serbian information commissioner Sabic. Clearly seeing his role as an ombudsman for the general population, he and his office led by example during the open data work I contributed to in the past years. By publishing statistics on information requests, complaints and answer times, and by publishing a full list of all Serbian institutions that fall under the remit of the Commission for Information of Public Importance and Personal Data Protection. This last thing is key, as some institutions will simply stall requests by stating transparency rules do not apply to them. Mr. Sabic’s term ended at the end of last year. A replacement for his position hasn’t been announced yet, which is both a testament to Mr Sabic’s independent role as information commissioner, and to the risk of less transparency inclined forces trying to get a much less independent successor.

Bookmarked Right to Know: A Beginner’s Guide to State Secrecy / Balkan Insight by Dusica Pavlovic (Balkan Insight)
Governments in the Balkans are chipping away at transparency laws to make it harder for journalists and activists to hold power to account.

There were several points made in the conversation after my presentation yesterday at Open Belgium 2019. This is a brief overview to capture them here.

1) One remark was about the balance between privacy and openness, and asking about (negative) privacy impacts.

The framework assumes government as the party being interested in measurement (given that that was the assignment for which it was created). Government held open data is by default not personal data as re-use rules are based on access regimes which in turn all exclude personal data (with a few separately regulated exceptions). What I took away from the remark is that, as we know new privacy and other ethical issues may arise from working with data combinations, it might be of interest if we can formulate indicators that try to track negative outcomes or spot unintended consequences, in the same way as we are trying to track positive signals.

2) One question was about if I had included all economic modelling work in academia etc.

I didn’t. This isn’t academic research either. It seeks to apply lessons already learned. What was included were existing documented cases, studies and research papers looking at various aspects of open data impact. Some of those are academic publications, some aren’t. What I took from those studies is two things: what exactly did they look at (and what did they find), and how did they assess a specific impact? The ‘what’ was used as potential indicator, the ‘how’ as the method. It is of interest to keep tracking new research as it gets published, to augment the framework.

3) Is this academic research?

No, its primary aim is as a practical instrument for data holders as well as national open data policy makers. It’s is not meant to establish scientific truth, and completely quantify impact once and for all. It’s meant to establish if there are signs the right steps are taken, and if that results in visible impact. The aim, and this connects to the previous question as well, is to avoid extensive modelling techniques, and favor indicators we know work, where the methods are straightforward. This to ensure that government data holders are capable to do these measurements themselves, and use it actively as an instrument.

4) Does it include citizen science (open data) efforts?

This is an interesting one (asked by Lukas of Luftdaten.info). The framework currently does include in a way the existence and emergence of citizen science projects, as that would come up in any stakeholder mapping attempts and in any emerging ecosystem tracking, and as examples of using government open data (as context and background for citizen science measurements). But the framework doesn’t look at the impact of such efforts, not in terms of socio-economic impact and not in terms of government being a potential user of citizen science data. Again the framework is to make visible the impact of government opening up data. But I think it’s not very difficult to adapt the framework to track citizen science project’s impact. Adding citizen science projects in a more direct way, as indicators for the framework itself is harder I think, as it needs more clarification of how it ties into the impact of open government data.

5) Is this based only on papers, or also on approaching groups, and people ‘feeling’ the impact?

This was connected to the citizen science bit. Yes, the framework is based on existing documented material only. And although a range of those base themselves on interviewing or surveying various stakeholders, that is not a default or deliberate part of how the framework was created. I do however recognise the value of for instance participatory narrative inquiry that makes the real experiences of people visible, and the patterns across those experiences. Including that sort of measurements would be useful especially on the social and societal impacts of open data. But currently none of the studies that were re-used in the framework took that approach. It does make me think about how one could set-up something like that to monitor impact e.g. of local government open data initiatives.

Today I gave a brief presentation of the framework for measuring open data impact I created for UNDP Serbia last year, at the Open Belgium 2019 Conference.

The framework is meant to be relatable and usable for individual organisations by themselves, and based on how existing cases, papers and research in the past have tried to establish such impact.

Here are the slides.

This is the full transcript of my presentation:

Last Friday, when Pieter Colpaert tweeted the talks he intended to visit (Hi Pieter!), he said two things. First he said after the coffee it starts to get difficult, and that’s true. Measuring impact is a difficult topic. And he asked about measuring impact: How can you possibly do that? He’s right to be cautious.

Because our everyday perception of impact and how to detect it is often too simplistic. Where’s the next Google the EC asked years ago. but it’s the wrong question. We will only know in 20 years when it is the new tech giant. But today it is likely a small start-up of four people with laptops and one idea, in Lithuania or Bulgaria somewhere, and we are by definition not be able to recognize it, framed this way. Asking for the killer app for open data is a similarly wrong question.

When it comes to impact, we seem to want one straightforward big thing. Hundreds of billions of euro impact in the EU as a whole, made up of a handful of wildly successful things. But what does that actually mean for you, a local government? And while you’re looking for that big impact you are missing all the smaller craters in this same picture, and also the bigger ones if they don’t translate easily into money.

Over the years however, there have been a range of studies, cases and research papers documenting specific impacts and effects. Me and my colleagues started collecting those a long time ago. And I used them to help contextualise potential impacts. First for the Flemish government, and last year for the Serbian government. To show what observed impact in for instance a Spanish sector would mean in the corresponding Belgian context. How a global prediction correlates to the Serbian economy and government strategies.

The UNDP in Serbia, asked me to extend that with a proposal for indicators to measure impact as they move forward with new open data action plans in follow up of the national readiness assessment I did for them earlier. I took the existing studies and looked at what they had tried to measure, what the common patterns are, and what they had looked at precisely. I turned that into a framework for impact measurement.

In the following minutes I will address three things. First what makes measuring impact so hard. Second what the common patterns are across existing research. Third how, avoiding the pitfalls, and using the commonalities we can build a framework, that then in itself is an indicator.Let’s first talk about the things that make measuring impact hard.

Judging by the available studies and cases there are several issues that make any easy answers to the question of open data impact impossible.There are a range of reasons measurement is hard. I’ll highlight a few.
Number 3, context is key. If you don’t know what you’re looking at, or why, no measurement makes much sense. And you can only know that in specific contexts. But specifying contexts takes effort. It asks the question: Where do you WANT impact.

Another issue is showing the impact of many small increments. Like how every Dutch person looks at this most used open data app every morning, the rain radar. How often has it changed a decision from taking the car to taking a bike? What does it mean in terms of congestion reduction, or emission reduction? Can you meaningfully quantify that at all?

Also important is who is asking for measurement. In one of my first jobs, my employer didn’t have email for all yet, so I asked for it. In response the MD asked me to put together the business case for email. This is a classic response when you don’t want to change anything. Often asking for measurement is meant to block change. Because they know you cannot predict the future. Motives shape measurements. The contextualisation of impact elsewhere to Flanders and Serbia in part took place because of this. Use existing answers against such a tactic.

Maturity and completeness of both the provision side, government, as well as the demand side, re-users, determine in equal measures what is possible at all, in terms of open data impact. If there is no mature provision side, in the end nothing will happen. If provision is perfect but demand side isn’t mature, it still doesn’t matter. Impact demands similar levels of maturity on both sides. It demands acknowledging interdependencies. And where that maturity is lacking, tracking impact means looking at different sets of indicators.

Measurements often motivate people to game the system. Especially single measurements. When number of datasets was still a metric for national portals the French opened with over 350k datasets. But really it was just a few dozen, which they had split according to departments and municipalities. So a balance is needed, with multiple indicators that point in different directions.

Open data, especially open core government registers, can be seen as infrastructure. But we actually don’t know how infrastructure creates impact. We know that building roads usually has a certain impact (investment correlates to a certain % rise in GDP), but we don’t know how it does so. Seeing open data as infrastructure is a logical approach (the consensus seems that the potential impact is about 2% of GDP), but it doesn’t help us much to measure impact or see how it creates that.

Network effects exist, but they are very costly to track. First order, second order, third order, higher order effects. We’re doing case studies for ESA on how satellite data gets used. We can establish network effects for instance how ice breakers in the Botnian gulf use satellite data in ways that ultimately reduce super market prices, but doing 24 such cases is a multi year effort.

E puor si muove! Galileo said Yet still it moves. The same is true for open data. Most measurements are proxies. They show something moving, without necessarily showing the thing that is doing the moving. Open data often is a silent actor, or a long range one. Yet still it moves.

Yet still it moves. And if we look at the patterns of established studies, that is what we indeed see. There are communalities in what movement we see. In the list on the slide the last point, that open data is a policy instrument is key. We know publishing data enables other stakeholders to act. When you do that on purpose you turn open data into a policy instrument. The cheapest one you have next to regulation and financing.

We all know the story of the drunk that lost his keys. He was searching under the light of a street lamp. Someone who helped him else asked if he lost the keys there. No, the drunk said, but at least there is light here. The same is true for open data. If you know what you published it for, at least you will be able to recognise relevant impact, if not all the impact it creates. Using it as policy instrument is like switching on the lights.

Dealing with lack of maturity means having different indicators for every step of the way. Not just seeing if impact occurs, but also if the right things are being done to make impact possible: Lead and lag indicators

The framework then is built from what has been used to establish impact in the past, and what we see in our projects as useful approaches. The point here is that we are not overly simplifying measurement, but adapt it to whatever is the context of a data provider or user. Also there’s never just one measurement, so a balanced approach is possible. You can’t game the system. It covers various levels of maturity from your first open dataset all the way to network effects. And you see that indicators that by themselves are too simple, still can be used.

Additionally the framework itself is a large scale sensor. If one indicator moves, you should see movement in other indicators over time as well. If you throw a stone in the pond, you should see ripples propagate. This means that if you start with data provision indicators only, you should see other measurements in other phases pick up. This allows you to both use a set of indicators across all phases, as well as move to more progressive ones when you outgrow the initial ones.finally some recommendations.

Some final thoughts. If you publish by default as integral part of processes, measuring impact, or building a business case is not needed as such. But measurement is very helpful in the transition to that end game. Core data and core policy elements, and their stakeholders are key. Measurement needs to be designed up front. Using open data as policy instrument lets you define the impact you are looking for at the least. The framework is the measurement: Only micro-economic studies really establish specific economic impact, but they only work in mature situations and cost a lot of effort, so you need to know when you are ready for them. But measurement can start wherever you are, with indicators that reflect the overall open data maturity level you are at, while looking both back and forwards. And because measurement can be done, as a data holder you should be doing it.

US Congress just before leaving for Christmas has voted to approve a new law, that mandates two key elements: public information is open by default and needs to be made actively available in machine readable format, as well as that policy making should be evidence based. In order for agencies to comply they will need to appoint a Chief Data Officer.

I think while of those two the first one (open data) is the more immediately visible, the second one, about evidence based policy making, is much more significant long term. Government, especially politics, often is willingly disinterested in policy impact evaluation. It’s much more status enhancing to announce new plans than admitting previous plans didn’t come to anything. Evidence based policy will help save money. Additionally government agencies will soon realise that doing evidence based policy making is made a lot easier if you already do open data well. The evidence you need is in that open data, and it being open allows all of us to go look for that evidence or its absence.

There’s one caveat to evidence based policy making: it runs the risk of killing any will to experiment. After all, by definition there’s no evidence for something new. So a way is needed in which new policies can be tried out as probes. To see if there’s emerging evidence of impact. Again, that evidence should become visible in existing open data streams. If evidence is found the experimental policy can be rolled out more widely. Evidence based policies need experiments to help create an evidence base, not just of what works but also of what doesn’t.

A great result for the USA’s open government activists. This basically codifies the initiatives of the Obama Presidency, which were the trigger for much of the global open data effort these last 10 years, into US federal law.

Recently I have been named the new chairman of the board of the Open State Foundation. This is a new role I am tremendously looking forward to take up. The Open State Foundation is the leading Dutch NGO concerning government transparency, and over the past years they’ve both persistently and in a very principled way pursued open data and government transparency, as well as constructively worked with government bodies to help them do better. Stef van Grieken, the chairman stepping down, has led the Open State Foundation board since it came into existence. The Open State Foundation is the merger of two earlier NGO’s, The New Voting (Het Nieuwe Stemmen) foundation of which Stef was the founder, and the Hack the Government (Hack de Overheid) collective.

Hack de Overheid emerged from the very first Dutch open government barcamp James Burke, Peter Robinett and I organised in the spring of 2008. The second edition in 2009 was the first Hack de Overheid event. My first open data project that same spring was together with James Burke and Alper Çuğun, both part of Hack de Overheid then and providing the tech savvy, and me being the interlocutor with the Ministry for the Interior, to guide the process and interpret the civil servant speak to the tech guys and vice versa. At the time Elsevier (a conservative weekly) published an article naming me one of the founders of Hack de Overheid, which was true in spirit, if technically incorrect.

In the past year and a half I had more direct involvement with the Open State Foundation than in the years between. Last year I did an in-depth evaluation of the effectiveness and lasting impact of the Open State Foundation in the period 2013-2017 and facilitated a discussion about their future, at the request of their director and one of their major funders. That made me appreciate their work in much richer detail than before. My company The Green Land and Open State Foundation also encounter each other on various client projects, giving me a perspective on the quality of their work and their team.

When Stef, as he’s been working in the USA for the past years, indicated he thought it time to leave the board, it coincided with me having signalled to the Open State Foundation that, if there ever was a need, I’d be happy to volunteer for the board. That moment thus came sooner than I expected. A few weeks ago Stef and I met up to discuss it, and then the most recent board meeting made it official.

Day to day the Open State Foundation is run by a very capable team and director. The board is an all volunteer ‘hands-off’ board, that helps the Open State Foundation guard its mission and maintain its status as a recognised charity in the Netherlands. I’m happy that I can help the Open State Foundation to stay committed to their goals of increasing government transparency and as a consequence the agency of citizens. I’m grateful to Stef, and the others that in the past decade have helped Open State Foundation become what it is now, from its humble beginnings at that barcamp in the run-down pseudo-squat of the former Volkskrant offices, now the hipster Volkshotel. I’m also thankful that I now have the renewed opportunity to meaningfully contribute to something I in a tiny way helped start a decade ago.

Last week I presented to a provincial procurement team about how to better support open data efforts. Below is what I presented and discussed.

Open data as policy instrument and the legal framework demands better procurement

Publishing open data creates new activity. It does so in two ways. It allows existing stakeholders to do more themselves or do things differently. It also allows people who could not participate before become active as well. We’ve seen for instance how opening up provincial and national geographic data increases the independent usage of that data by local governments. We’ve also seen how for instance the Dutch hiking association started using national geographic data to create and better document routes. To the surprise of the Cadastre a whole new area of usage appeared as well, by cultural organisations who before had never requested such data. So open data is an enabler for agency.

If as a government data holder you know this effect takes place, you can also try and achieve it deliberately. For policy domains and groups of stakeholders where you would like to see more activity, publishing data then is an instrument in for instance achieving your own policy goals. Next to regulation and financing, publishing open data is a new third policy instrument. It also happens to be the cheapest of those three to deploy.

Open data in the EU has a legal framework where over time more things are mandated. There is a right to re-use. Upon request dataholders must be able to provide machine readable data formats. In the Netherlands open standards are compulsory for government entities since 2008. Exclusive access to government data for re-use is, except for a few very strictly regulated situations, illegal.

To be able to comply with the legal framework, and to be able to actively use open data as a policy instrument, public sector bodies must pay more attention to how they acquire data, and as a consequence must pay more attention to what happens during procurement processes. If you don’t the government entity’s data sovereignty is strongly diminished, which carries costs.

Procurement awareness needed on multiple levels

The goal is to ensure full data sovereignty. This means paying real attention to various things on different levels of abstraction around procurement.

  • Ensuring data is received in open standards and regular domain specific standards
  • Ensure when reports are received that the data used, such as for graphs and tables, are also received
  • Ensure when information products are received (maps, visualisations) the data used for them are also received
  • Ensure procurement and collaboration contracts do not preclude sharing data with third parties, apart from on grounds already mentioned as exceptions in the law on freedom of information and re-use
  • Ensure that when raw data is provided to service providers, that data is still available to the government entity
  • Ensure that when data is collected by external entities who in turn outsource the collection, all parties involved know the data falls under the decision making power of the government entity
  • Ensure in collaborations you do not sign away decision power over the data you contribute, you have rights to the data you collectively create, and have as little restriction as possible on the data others contribute.

What could go wrong?

Unless you always pay attention to these points, you run the risk of losing your data sovereignty. This can lead to situations where a government entity is no longer able to comply with its own legal obligations concerning data provision and transparency.

A few existing examples from what can go wrong.

  • A province is counting bicycle traffic through a network of sensors they deployed themselves. The data is directly transmitted to a service provider in a different country. The province can see dashboards and download reports, but has no access to the sensor data itself, and cannot download the sensor data. While any citizen requesting the data could not be provided with that data, the service provider itself does base commercial services on that and other data it receives, having de facto exclusive access to it.
  • Another province is outsourcing bird inventory counting to nature preservation organisations, who in turn rely on volunteers to do the bird watching. The province pays for the effort. When it comes to sharing the data publicly, the nature preservation organisations say their volunteers actually own the data, so nothing can be publicly shared. This is untrue for multiple reasons (database rights do not apply, it is a paid for effort so procurement terms that unequivocally transfer such rights should they exist to the province etc), but as the province doesn’t want to waste time on this, nor wants to get into a fight, it leaves it be, resulting in the data not being made available.
  • An energy network provider pools a lot of different data sources concerning energy usage in their service area from a network of collaborating entities, both private and public. They also publish a lot of open data already. As part of the national effort towards energy transition they receive many data requests from local governments, housing associations and other entities. They would like to provide data, as they see it as a way of contributing to an essential public task (energy transition), but still say no to data requests in 60% of all cases. Because they can’t figure out which contractual obligations apply to which parts of the data, or cannot reconcile conflicting or ambiguous contract clauses concerning the data.
  • All provinces pool data concerning economic activity and the labor market in a private foundation in which also private entities participate. That foundation sells data subscriptions. Currently they also publish some open data, but if any of the provinces would like to do more, they would have to wait for full agreement. The slowest in the group would determine the actual level of transparency.
  • A province has outsourced the creation of a ‘heat transition atlas’, in which the potential for moving away from natural gas burning heating systems in homes using various alternatives is mapped. The resulting interactive website contains different data layers, but those data layers are themselves unavailable. Although there is a general list of which data sources have been used, it is not precisely stating its sources and not providing details on how the data has been transformed for the website.

In all cases the public sector data holder has put itself in a position that could have been prevented had they paid more attention at the time of procurement or at the time of entering into collaboration. All these situations can be fixed later on, but they require additional effort, time and costs to arrange, which are unnecessary if dealt with during procurement.

But we have procurement regulations already!

What about procurement regulations. We have those, so don’t they cover all this? Mostly not it turns out.

Terms of procurement talk about rights transfer of all deliverables, but in many cases the data involved isn’t listed as a deliverable, so not covered by those terms.
The terms talk about transfer of database rights, but those hardly ever apply as usually the scale of data collection and structuring into a database is limited.
Concerning research there is some talk about also transferring the data concerned, but a lot of reports aren’t research but consultancy services.

In the general regulations that apply to provincial procurement, the word data only is used in the context of personal data protection, as the dutch plural for date, and in the context of data carriers (hard drives etc). The word standards never occurs, nor does it contain references to data formats (even though legal obligations exist for government entities concerning standards and data formats)

The procurement terms are neither broad enough, nor detailed enough.

How to improve the situation

So what needs to be arranged to ensure government entities arrange their data needs correctly during procurement? How to plug the holes? A few things at the very least:

  • Likely, when it comes to standards and formats (which may differ per domain), the only viable place is in the mandatory technical requirements in a call for tender / request for proposals.
  • To get the data behind graphs, tables, info products and reports, including a list of resources and transformations applied, it needs to be specified in the list of deliverables.
  • Collaboration contracts entered into should always have articles on sharing the data you contribute, being able to share the data resulting from the collaboration, and rules about data that others contribute.

It is important to realise that you cannot through contracts do away with any mandatory transparency, open data, or data governance aspects. Any resulting issues will mean time consuming and likely costly repair activities.

Who needs to be involved

In order to prevent the costs of repair or mitigation of consequences, there are a number of questions concerning who should be doing what, inside a government entity.

  • What needs to be arranged at the point of tender, who will check it?
  • What needs to be part of all project starts (e.g. Checklists, data paragraphs), is the project manager aware of this, and who will check it?
  • Who at the writing and signing of any contract will check data aspects?
  • Who at the time of delivery will check if data requirements are met?
  • What part of this is more about awareness and operatios, what needs to be done through regulation?

Our work in the next steps

We intend to assist the province involved in making sure procurement better enables data sharing from now on. Steps we are currently taking to move this forward are:

  • We’ve put data sovereignty into the organisations strategy document, and tied it into overall data governance improvement.
  • With the information management department we’ll visit all main procurers to discuss and propose actions
  • We’ll likely build one or more checklists for different aspects
  • We’ll work with a 3 person team from the procurement department to more deeply embed data awareness and amend procurement processes

All this is basically a preventative step to ensure the province has its house in order concerning data.

During his keynote at the Partos Innovation Festival Kenyan designer Mark Kamau mentioned that “45% of Kenya’s GDP was mobile.” That is an impressive statistic, so I wondered if I could verify it. With some public and open data, it was easy to follow up.

World Bank data pegs Kenya’s GDP in 2016 at some 72 billion USD.
Kenya’s central bank publishes monthly figures on the volume of transactions through mobile, and for September 2018 it reports 327 billion KSh, while the lowest monthly figure is February at 300 billion. With 100 Ksh being equivalent to 1 USD, this means the monthly transaction volume exceeds 3 billion USD every month. For a year this means 3*12=36 billion USD, or about half of the 2016 GDP figure. An amazing volume.

For the UNDP in Serbia, I made an overview of existing studies into the impact of open data. I’ve done something similar for the Flemish government a few years ago, so I had a good list of studies to start from. I updated that first list with more recent publications, resulting in a list of 45 studies from the past 10 years. The UNDP also asked me to suggest a measurement framework. Here’s a summary overview of some of the things I formulated in the report. I’ll start with 10 things that make measuring impact hard, and in a later post zoom in on what makes measuring impact doable.

While it is tempting to ask for a ‘killer app’ or ‘the next tech giant’ as proof of impact of open data, establishing the socio-economic impact of open data cannot depend on that. Both because answering such a question is only possible with long term hindsight which doesn’t help make decisions in the here and now, as well as because it would ignore the diversity of types of impacts of varying sizes known to be possible with open data. Judging by the available studies and cases there are several issues that make any easy answers to the question of open data impact impossible.

1 Dealing with variety and aggregating small increments

There are different varieties of impact, in all shapes and sizes. If an individual stakeholder, such as a citizen, does a very small thing based on open data, like making a different decision on some day, how do we express that value? Can it be expressed at all? E.g. in the Netherlands the open data based rain radar is used daily by most cyclists, to see if they can get to the rail way station dry, better wait ten minutes, or rather take the car. The impact of a decision to cycle can mean lower individual costs (no car usage), personal health benefits, economic benefits (lower traffic congestion) environmental benefits (lower emissions) etc., but is nearly impossible to quantify meaningfully in itself as a single act. Only where such decisions are stimulated, e.g. by providing open data that allows much smarter, multi-modal, route planning, aggregate effects may become visible, such as reduction of traffic congestion hours in a year, general health benefits of the population, reduction of traffic fatalities, which can be much better expressed in a monetary value to the economy.

2 Spotting new entrants, and tracking SME’s

The existing research shows that previously inactive stakeholders, and small to medium sized enterprises are better positioned to create benefits with open data. Smaller absolute improvements are of bigger value to them relatively, compared to e.g. larger corporations. Such large corporations usually overcome data access barriers with their size and capital. To them open data may even mean creating new competitive vulnerabilities at the lower end of their markets. (As a result larger corporations are more likely to say they have no problem with paying for data, as that protects market incumbents with the price of data as a barrier to entry.) This also means that establishing impacts requires simultaneously mapping new emerging stakeholders and aggregating that range of smaller impacts, which both can be hard to do (see point 1).

3 Network effects are costly to track

The research shows the presence of network effects, meaning that the impact of open data is not contained or even mostly specific to the first order of re-use of that data. Causal effects as well as second and higher order forms of re-use regularly occur and quickly become, certainly in aggregate, much higher than the value of the original form of re-use. For instance the European Space Agency (ESA) commissioned my company for a study into the impact of open satellite data for ice breakers in the Gulf of Bothnia. The direct impact for ice breakers is saving costs on helicopters and fuel, as the satellite data makes determining where the ice is thinnest much easier. But the aggregate value of the consequences of that is much higher: it creates a much higher predictability of ships and the (food)products they carry arriving in Finnish harbours, which means lower stocks are needed to ensure supply of these goods. This reverberates across the entire supply chain, saving costs in logistics and allowing lower retail prices across Finland. When 
mapping such higher order and network effects, every step further down the chain of causality shows that while the bandwidth of value created increases, at the same time the certainty that open data is the primary contributing factor decreases. Such studies also are time consuming and costly. It is often unlikely and unrealistic to expect data holders to go through such lengths to establish impact. The mentioned ESA example, is part of a series of over 20 such case studies ESA commissioned over the course of 5 years, at considerable cost for instance.

4 Comparison needs context

Without context, of a specific domain or a specific issue, it is hard to asses benefits, and compare their associated costs, which is often the underlying question concerning the impact of open data: does it weigh up against the costs of open data efforts? Even though in general open data efforts shouldn’t be costly, how does some type of open data benefit compare to the costs and benefits of other actions? Such comparisons can be made in a specific context (e.g. comparing the cost and benefit of open data for route planning with other measures to fight traffic congestion, such as increasing the number of lanes on a motor way, or increasing the availability of public transport).

5 Open data maturity determines impact and type of measurement possible

Because open data provisioning is a prerequisite for it having any impact, the availability of data and the maturity of open data efforts determine not only how much impact can be expected, but also determine what can be measured (mature impact might be measured as impact on e.g. traffic congestion hours in a year, but early impact might be measured in how the number of re-users of a data set is still steadily growing year over year)

6 Demand side maturity determines impact and type of measurement possible

Whether open data creates much impact is not only dependent on the availability of open data and the maturity of the supply-side, even if it is as mentioned a prerequisite. Impact, judging by the existing research, is certain to emerge, but the size and timing of such impact depends on a wide range of other factors on the demand-side as well, including things as the skills and capabilities of stakeholders, time to market, location and timing. An idea for open data re-use that may find no traction in France because the initiators can’t bring it to fruition, or because the potential French demand is too low, may well find its way to success in Bulgaria or Spain, because local circumstances and markets differ. In the Serbian national open data readiness assessment performed by me for the World Bank and the UNDP in 2015 this is reflected in the various dimensions assessed, that cover both supply and demand, as well as general aspects of Serbian infrastructure and society.

7 We don’t understand how infrastructure creates impact

The notion of broad open data provision as public infrastructure (such as the UK, Netherlands, Denmark and Belgium are already doing, and Switzerland is starting to do) further underlines the difficulty of establishing the general impact of open data on e.g. growth. The point that infrastructure (such as roads, telecoms, electricity) is important to growth is broadly acknowledged, with the corresponding acceptance of that within policy making. This acceptance of quantity and quality of infrastructure increasing human and physical capital however does not mean that it is clear how much what type of infrastructure contributes at what time to economic production and growth. Public capital is often used as a proxy to ascertain the impact of infrastructure on growth. Consensus is that there is a positive elasticity, meaning that an increase in public capital results in an increase in GDP, averaging at around 0.08, but varying across studies and types of infrastructure. Assuming such positive elasticity extends to open data provision as infrastructure (and we have very good reasons to do so), it will result in GDP growth, but without a clear view overall as to how much.

8 E pur si muove

Most measurements concerning open data impact need to be understood as proxies. They are not measuring how open data is creating impact directly, but from measuring a certain movement it can be surmised that something is doing the moving. Where opening data can be assumed to be doing the moving, and where opening data was a deliberate effort to create such movement, impact can then be assessed. We may not be able to easily see it, but still it moves.

9 Motives often shape measurements

Apart from the difficulty of measuring impact and the effort involved in doing so, there is also the question of why such impact assessments are needed. Is an impact assessment needed to create support for ongoing open data efforts, or to make existing efforts sustainable? Is an impact measurement needed for comparison with specific costs for a specific data holder? Is it to be used for evaluation of open data policies in general? In other words, in whose perception should an impact measurement be meaningful?
The purpose of impact assessments for open data further determines and/or limits the way such assessments can be shaped.

10 Measurements get gamed, become targets

Finally, with any type of measurement, there needs to be awareness that those with a stake of interest into a measurement are likely to try and game the system. Especially so where measurements determine funding for further projects, or the continuation of an effort. This must lead to caution when determining indicators. Measurements easily become a target in themselves. For instance in the early days of national open data portals being launched worldwide, a simple metric often reported was the number of datasets a portal contained. This is an example of a ‘point’ measurement that can be easily gamed for instance by subdividing a dataset into several subsets. The first version of the national portal of a major EU member did precisely that and boasted several hundred thousand data sets at launch, which were mostly small subsets of a bigger whole. It briefly made for good headlines, but did not make for impact.

In a second part I will take a closer look at what these 10 points mean for designing a measurement framework to track open data impact.

This week I am in Novi Sad for the plenary of the Assembly of European Regions. Novi Sad is the capitol of the Vojvodina, a member region, and the host for the plenary meetings of the AER.

I took part in a panel to discuss the opportunities of open data at regional level. The other panelists were my Serbian UNDP colleague Slobodan Markovic, Brigitte Lutz of the Vienna open data portal (whom I hadn’t met in years), Margreet Nieuwenhuis of the European open data portal, and Geert-Jan Waasdorp who uses open data about the European labour market commercially.

Below are the notes I used for my panel contributions:

Open data is a key building block for any policy plan. The Serbian government certainly treats it as such, judging by the PM’s message we just heard, and the same should be true for regional governments.

Open data from an organisational stand point is only sustainable if it is directly connected to primary policy processes, and not just an additional step or effort after the ‘real’ work has been done. It’s only sustainable if it means something for your own work as regional administration.

We know that open data allows people and organisations to take new actions. These by themselves or in aggregate have impact on policy domains. E.g. parents choosing schools for their children or finding housing, multimodal route planning, etc.

So if you know this effect exists, you can use it on purpose. Publish data to enable external stakeholders. You need to ask yourself: around which policy issues do you want to enable more activity? Which stakeholders do you want to enable or nudge? Which data will be helpful for that, if put into the hands of those stakeholders?

This makes open data a policy instrument. Next to funding and regulation, publishing open data for others to use is a way to influence stakeholder behaviour. By enabling them and partnering with them.
It is actually your cheapest policy instrument, as the cost of data collection is always a sunk cost as part of your public task

Positioning open data this way, as a policy instrument, requires building connections between your policy issues, external stakeholders and their issues, and the data relevant in that context.

This requires going outside and listen to stakeholders and understand the issues they want to solve, the things they care about. You need to avoid making any assumptions.

We worked with various regional governments in the Netherlands, including the two Dutch AER members Flevoland and Gelderland. With them we learned that having those outside conversations is maybe the hardest part. To create conversations between a policy domain expert, an internal data expert, and the external stakeholders. There’s often a certain apprehension to reach out like that and have an open ended conversation on equal footing. From those conversations you learn different things. That your counterparts are also professionals interested in achieving results and using the available data responsibly. That the ways in which others have shaped their routines and processes are usually invisible to you, and may be surprising to you.
In Flevoland there’s a program for large scale maintenance on bridges and water locks in the coming 4 years. One of the provincial aims was to reduce hindrance. But an open question was what constitutes hindrance to different stakeholders. Only by talking to e.g. farmers it became clear that the maintenance plans themselves were less relevant than changes in those plans: a farmer rents equipment a week before some work needs to be done on the fields. If within that week a bridge unexpectedly becomes blocked, it means he can’t reach his fields with the rented equipment and damage is done. Also relevant is exploring which channels are useful to stakeholders for data dissemination. Finding channels that are used already by stakeholders or channels that connect to those is key. You can’t assume people will use whatever special channel you may think of building.

Whether it is about bridge maintenance, archeology, nitrate deposition, better usage of Interreg subsidies, or flash flooding after rain fall, talking about open data in terms of innovation and job creation is hollow and meaningless if it is not connected to one of those real issues. Only real issues motivate action.

Complex issues rarely have simple solutions. That is true for mobility, energy transition, demographic pressure on public services, emission reduction, and everything else regional governments are dealing with. None of this can be fixed by an administration on its own. So you benefit from enabling others to do their part. This includes local governments as stakeholder group. Your own public sector data is one of the easiest available enables in your arsenal.