It’s only morning on Easter Sunday, but apparently in Germany, over 160 kilometers away, Easter fires have been burning on Saturday evening. This morning we woke up to a distinct smell of burning outside (and not just of the wood burning type of smell, also plastics). Dutch Easter fires usually burn on Easter Sunday, not the evening before. So we looked up if there had been a nearby fire, but no, it’s Easter fires from far away.

The national air quality sensor grid documents the spike in airborne particles clearly.
First a sensor near where E’s parents live, on the border with Germany.

A clear PM10 spike starts on Saturday evening, and keeps going throughout the night. It tops out at well over 200 microgram per cubic meter of air at 6 am this morning, or over 5 times the annual average norm deemed acceptable.

The second graph below is on a busy road in Utrecht, about 20 mins from here, and 180 kilometers from the previous sensor. The spike starts during the night, when the wind has finally blown the smoke here, and is at just over 80 microgram per cubic meter of air at 8 am, or double the annual average norm deemed acceptable.

This likely isn’t the peak value yet, as a sensor reading upwind from us shows readings still rising at 9 am:

On a map the sensor points show how the smoke is coming from the north east. The red dot at the top right is Ter Apel, the first sensor reading shown above, the other red points moving west and south have their peaks later or are still showing a rise in PM10 values.

The German website luftdaten.info also shows nicely how the smoke from the north eastern part of Germany, between Oldenburg and the border with the Netherlands is moving across the Netherlands.

The wind isn’t going to change much, so tomorrow the smell will likely be worse, as by then all the Easter fires from Twente will have burnt as well, adding their emissions to the mix.

Two years ago a colleague let their dog swim in a lake without paying attention to the information signs. It turned out the water was infested with a type of algae that caused the dog irritation. Since then my colleague thought it would be great if you could somehow subscribe to notifications of when the quality of status of some nearby surface water changes.

Recently this colleague took a look at the provincial external communications concerning swimming waters. A provincial government has specific public tasks in designating swimming waters and monitoring its quality. It turns out there are six (6) public information or data sources from the particular province my colleague lives in concerning swimming waters.

My colleague compared those 6 datasets on a number of criteria: factual correctness, comparability based on an administrative index or key, and comparability on spatial / geographic aspects. Factual correctness here means whether the right objects have been represented in the data sets. Are the names, geographic location, status (safe, caution, unsafe) correct? Are details such as available amenities represented correctly everywhere?

Als ze me missen, ben ik vissen
A lake (photo by facemepls, license CC-BY)

As it turns out each of the 6 public data sets contains a different number of objects. The 6 data sets cannot be connected based on a unique key or ID. Slightly more than half of the swimming waters can be correlated across the 6 data sets by name, but a spatial/geographic connection isn’t always possible. 30% of swimming waters have the wrong status (safe/caution/unsafe) on the provincial website! And 13% of swimming waters are wrongly represented geometrically, meaning they end up in completely wrong locations and even municipalities on the map.

Every year at the start of the year the provincial government takes a decision which designates the public swimming waters. Yet the decision from this province cannot be found online (even though it was taken last February, and publication is mandatory). Only a draft decision can be found on the website of one of the municipalities concerned.

The differences in the 6 data sets are more or less reflective of the internal division of tasks of the province. Every department keeps its own files, and dataset. One is responsible for designating public swimming waters, another for monitoring swimming water quality. Yet another for making sure those swimming waters are represented in overall public planning / environmental plans. Another for the placement and location of information signs about the water quality, and still another for placing that same information on the website of the province. Every unit has their own task and keeps their own data set for it.

Which ultimately means large inconsistencies internally, and a confusing mix of information being presented to the public.

As of today it is final: the new EU copyright directive has been adopted (ht Julia Reda). I am pleased to see my government voted against, as it has in earlier stages, and as my MEPs did. Sadly it hasn’t been enough to cut Article 11 and 13, despite the mountain of evidence and protests against both articles. It is interesting and odd to see both Spain and Germany vote in favour, given the failure of their respective laws on which Article 11 is based, and the German government coalition parties stated position of being against content filters (i.e. Article 13).

Over the next two years it is important to track the legislative efforts in Member States implementing this Directive. Countries that voted against or abstained will try to find the most meaningless implementation of both Articles 11 and 13, and will be emphasising the useful bits in other parts of the Directive I suspect, while subjected to intense lobbying efforts both for and against. The resulting differences in interpretation across MS will be of interest. Also looking forward to following the court challenges that will undoubtedly result.

In the mean time, you as an internet-citizen have two more years to build and extend your path away from the silos where Article 11 and 13 will be an obstacle to you. Run your own stuff, decentralise and federate. Walkaway from the big platforms. But most of all, interact with creators and makers directly. Both when it comes to re-using or building on their creations, as when it comes to supporting them. Article 11 and 13 will not bring any creator any new revenue, dominant entertainment industry mediators are the ones set to profit from rent seeking. Vote with your feet and wallet.

As part of my work for a Dutch regional government, I was asked to compare the open data offerings of the 12 provinces. I wanted to use something that levels the playing field for all parties compared and prevents me comparing apples to oranges, so opted for the Dutch national data portal as a source of data. An additional benefit of this is that the Dutch national portal (a CKAN instance) has a well defined API, and uses standardised vocabularies for the different government entities and functions of government.

I am interested in openness, findability, completeness, re-usability, and timeliness. For each of those I tried to pick something available through the API, that can be a proxy for one or more of those factors.

The following aspects seemed most useful:

  • openness: use of open licenses
  • findability: are datasets categorised consistently and accurately so they can be found through the policy domains they pertain to
  • completeness: does a province publish across the entire spectrum of a) national government’s list of policy domains, and b) across all 7 core tasks as listed by the association of provincial governments
  • completeness: does a province publish more than just geographic data (most of their tasks are geo-related, but definitely not all)
  • re-usability: in which formats do provinces publish, and are these a) open standards, b) machine readable, c) structured data

I could not establish a useful proxy for timeliness, as all the timestamps available through the API of the national data portal actually represent processes (when the last automatic update ran), and contain breaks (the platform was updated late last year, and all timestamps were from after that update).

Provinces publish data in three ways, and the API of the national portal makes the source of a dataset visible:

  1. they publish geographic data to the Dutch national geographic register (NGR), from which metadata is harvested into the Dutch open data portal. It used to be that only openly licensed data was harvested but since November last year also closed licensed data is being harvested into the national portal. It seems this is done by design, but this major shift has not been communicated at all.
  2. they publish non-geographic data to dataplatform.nl, a CKAN platform provided as a commercial service to government entities to host open data (as the national portal only registers metadata, and isn’t storing data). Metadata is automatically harvested into the national portal.
  3. they upload metadata directly to the national portal by hand, pointing to specific data sources online elsewhere (e.g. the API of an image library)

Most provinces only publish through the National Geo Register (NGR). Last summer I blogged about that in more detail, and nothing has changed really since then.

I measured the mentioned aspects as follows:

  • openness: a straight count of openly licensed data sets. It is national policy to use public domain, CC0 or CC-BY, and this is reflected in what provinces do. So no need to distinguish between open licenses, just between open and not-openly licensed material
  • findability: it is mandatory to categorise datasets, but voluntary to add more than one category, with a maximum of 3. I looked at the average number of categories per dataset for each province. One only categorises with one term, some consistently provide more complete categorisation, where most end up in between those two.
  • completeness: looking at those same categories, a total of 22 different ones were used. I also looked at how many of those 22 each province uses. As all their tasks are similar, the extend to which they cover all used categories is a measure for how well they publish across their spectrum of tasks. Additionally provinces have self-defined 7 core tasks, to which those categories can be mapped. So I also looked at how many of those 7 covered. There are big differences in the breadth of scope of what provinces publish.
  • completeness: while some 80% of all provincial data is geo-data and 20% non-geographic, less than 1% of open data is non-geographic data. Looking at which provinces publish non-geographic data, I used the source of it (i.e. not from the NGR), and did a quick manual check on the nature of what was published (as it was just 22 data sets out of over 3000, this was still easily done by hand).
  • re-usability: for all provinces I polled the formats in which data sets are published. Data sets can be published in multiple formats. All used formats I judged on being a) open standards, b) machine readable, c) structured data. Formats that matched all 3 got 3 points, that matched machine readable and structure but not open standards 1 points, and didn’t match structure or machine readability no points. I then divided the number of points by the total number of data formats they used. This way you get a score of at most 3, and the closer you get to 3, the more of your data matches the open definition.

As all this is based on the national portal’s API, getting the data and calculating scores can be automated as an ongoing measurement to build a time series of e.g. monthly checks to track development. My process only contained one manual action (concerning non-geo data), but it could be done automatically followed up at most with a quick manual inspection.

In terms of results (which now have been first communicated to our client), what becomes visible is that some provinces score high on a single measure, and it is easy to spot who has (automated) processes in place for one or more of the aspects looked at. Also interesting is that the overall best scoring province is not the best scoring on any of the aspects but high enough on all to have the highest average. It’s also a province that spent quite a lot of work on all steps (internally and publication) of the chain that leads to open data.

Harold Jarche looked at his most visited blog postings over the years, and concludes his blog conforms to Sturgeon’s Revelation that 90% of everything is crap.

I recognise much of what Harold writes. I suspect this is also what feeds impostor syndrome. You see the very mixed bag of results from your own efforts, and how most of it is ‘crap’. The few ‘hits’ for which you get positive feedback are then either ‘luck’ or should be normal, not sparse. Others of course forget most if not all of your less stellar products and remember mostly the ones that stood out. Only you are in a position to compare what others respond to with your internal perspective.

At the same time, like Harold, I’ve realised that it is important to do things, to keep blogging and writing in this space. Not because of its sheer brilliance, but because most of it will be crap, and brilliance will only occur once in a while. You need to produce lots of stuff to increase the likelihood of hitting on something worthwile. Of course that very much feeds the imposter cycle, but it’s the only way. Getting back into a more intensive blogging habit 18 months ago, has helped me explore more and better. Because most of what I blog here isn’t very meaningful, but needs to be gotten out of the way, or helps build towards, scaffolding towards something with more meaning.

It’s why I always love to see (photographs of) artist’s studio’s. The huge mess and mountains of crap. The two hundred attempts at getting a single thing to feel right for once. Often we see master pieces only nicely presented and lighted on a gallery wall. But the artist never saw it like that, s/he inhabits that studio where whatever ends up on a museum wall someday is just one thing in a mountain of other things, between aborted efforts, multiple works in progress, random objects and yesterday’s newspaper.

Frank Meeuwsen en ik hebben een datum geprikt en een locatie gevonden. IndieWebCamp gaat door, op 18 en 19 mei in Utrecht in de ruimte van Shoppagina.nl, Kanaalweg 14-L in Utrecht. Een rudimentaire site met aankondiging vind je op IndieWebCamp, en dezelfde info vind je in de IndieWeb wiki.

Er is plek voor maximaal 35 mensen. Dus als je heel graag aanwezig wilt zijn, mail mij of Frank dan alvast even. De inschrijving gaat binnenkort open. We zoeken nog wat ondersteuning, zoals een sponsor voor de lunch op zaterdag en zondag. Als je iets bij kunt of wilt dragen laat het weten!

The Mozilla foundation has launched a new service that looks promising, which is why I am bookmarking it here. Firefox Send allows you to send up to 1GB (or 2.5GB if logged in) files to someone else. This is the same as services like Dutch WeTransfer does, except it does so with end-to-end encryption.

Files are encrypted in your browser, before being send to Mozilla’s server until downloaded. The decryption key is contained in the download URL. That download URL is not send to the receiver by Mozilla, but you do that yourself. Files can be locked with an additional password that needs to be conveyed to the receiver by the sender through other means as well. Files are kept 5 minutes, 1 or 24 hours, or 7 days, depending on your choice, and for 1 or up to 100 downloads. This makes it suitable for quick shares during conference calls for instance. Apart from the encrypted file, Mozilla only knows the IP address of the uploader and the downloader(s). Unlike services like WeTransfer where the service also has e-mail addresses for both uploader and intended downloader, and you are dependent on them sending the receivers a confirmation with the download link first.


Firefox Send doesn’t send the download link to the recipient, you do

This is an improvement in terms of data protection, even if not fully water tight (nothing ever really is, especially not if you are a singled out target by a state actor). It does satisfy the need of some of my government clients who are not allowed to use services like WeTransfer currently.

Granularity - legos, crayons, and moreGranularity (photo by Emily, license: CC-BY-NC)

A client, after their previous goal of increasing the volume of open data provided, is now looking to improve data quality. One element in this is increasing the level of detail of the already published data. They asked for input on how one can approach and define granularity. I formulated some thoughts for them as input, which I am now posting here as well.

Data granularity in general is the level of detail a data set provides. This granularity can be thought of in two dimensions:
a) whether a combination of data elements in the set is presented in one field or split out into multiple fields: atomisation
b) the relative level of detail the data in a set represents: resolution

On Atomisation
Improving this type of granularity can be done by looking at the structure of a data set itself. Are there fields within a data set that can be reliably separated into two or more fields? Common examples are separating first and last names, zipcodes and cities, streets and house numbers, organisations and departments, or keyword collections (tags, themes) into single keywords. This allows for more sophisticated queries on the data, as well as more ways it can potentially be related to or combined with other data sets.

For currently published data sets improving this type of granularity can be done by looking at the existing data structure directly, or by asking the provider of the data set if they have combined any fields into a single field when they created the dataset for publication.

This type of granularity increase changes the structure of the data but not the data itself. It improves the usability of the data, without improving the use value of the data. The data in terms of information content stays the same, but does become easier to work with.

On Resolution
Resolution can have multiple components such as: frequency of renewal, time frames represented, geographic resolution, or splitting categories into sub-categories or multilevel taxonomies. An example is how one can publish average daily temperature in a region. Let’s assume it is currently published monthly with one single value per day. Resolution of such a single value can be increased in multiple ways: publishing the average daily temperature daily, not monthly. Split up the average daily temperature for the region, into average daily temperature per sensor in that region (geographic resolution). Split up the average single sensor reading into hourly actual readings, or even more frequent. The highest resolution would be publishing real-time individual sensor readings continuously.

Improving resolution can only be done in collaboration with the holder of the actual source of the data. What level of improvement can be attained is determined by:

  1. The level of granularity and frequency at which the data is currently collected by the data holder
  2. The level of granularity or aggregation at which the data is used by the data holder for their public tasks
  3. The level of granularity or aggregation at which the data meets professional standards.

Item 1 provides an absolute limit to what can be done: what isn’t collected cannot be published. Usually however data is not used internally in the exact form it was collected either. In terms of access to information the practical limit to what can be published is usually the way that data is available internally for the data holder’s public tasks. Internal systems and IT choices are shaped accordingly usually. Generally data holders can reliably provide data at the level of Item 2, because that is what they work with themselves.

However, there are reasons why data sometimes cannot be publicly provided the same way it is available to the data holder internally. These can be reasons of privacy or common professional standards. For instance energy companies have data on energy usage per household, but in the Netherlands such data is aggregated to groups of at least 10 households before publication because of privacy concerns. National statistics agencies comply with international standards concerning how data is published for external use. Census data for instance will never be published in the way it was collected, but only at various levels of aggregation.

Discussions on the desired level of resolution need to be in collaboration with potential re-users of the data, not just the data holders. At what point does data become useful for different or novel types of usage? When is it meeting needs adequately?

Together with data holders and potential data re-users the balance needs to be struck between re-use value and considerations of e.g. privacy and professional standards.

This type of granularity increase changes the content of the data. It improves the usage value of the data as it allows new types of queries on the data, and enables more nuanced contextualisation in combination with other datasets.

This week NBC published an article exploring the source of training data sets for facial recognition. It makes the claim that we ourselves are providing, without consent, the data that may well be used to put us under surveillance.

In January IBM made a database available for research into facial recognition algorithms. The database contains some 1 million face descriptions that can be used as a training set. Called “Diversity in Faces” the stated aim is to reduce bias in current facial recognition abilities. Such bias is rampant often due to too small and too heterogenous (compared to the global population) data sets used in training. That stated goal is ethically sound it seems, but the means used to get there raises a few questions with me. Specifically if the means live up to the same ethical standards that IBM says it seeks to attain with the result of their work. This and the next post explore the origins of the DiF data, my presence in it, and the questions it raises to me.

What did IBM collect in “Diversity in Faces”?
Let’s look at what the data is first. Flickr is a photo sharing site, launched in 2004, that started supporting publishing photos with a Creative Commons license from early on. In 2014 a team led by Bart Thomee at Yahoo, which then owned Flickr, created a database of 100 million photos and videos with any type of Creative Commons license published in previous years on Flickr. This database is available for research purposes and known as the ‘YFCC-100M’ dataset. It does not contain the actual photos or videos per se, but the static metadata for those photos and videos (urls to the image, user id’s, geo locations, descriptions, tags etc.) and the Creative Commons license it was released under. See the video below published at the time:

YFCC100M: The New Data in Multimedia Research from CACM on Vimeo.

IBM used this YFCC-100M data set as a basis, and selected 1 million of the photos in it to build a large collection of human faces. It does not contain the actual photos, but the metadata of that photo, and a large range of some 200 additional attributes describing the faces in those photos, including measurements and skin tones. Where YFC-100M was meant to train more or less any image recognition algorithm, IBM’s derivative subset focuses on faces. IBM describes the dataset in their Terms of Service as:

a list of links (URLs) of Flickr images that are publicly available under certain Creative Commons Licenses (CCLs) and that are listed on the YFCC100M dataset (List of URLs together with coding schemes aimed to provide objective measures of human faces, such as cranio-facial features, as well as subjective annotations, such as human-labeled annotation predictions of age and gender(“Coding Schemes Annotations”). The Coding Schemes Annotations are attached to each URL entry.

My photos are in IBM’s DiF
NBC, in their above mentioned reporting on IBM’s DiF database, provide a little tool to determine if photos you published on Flickr are in the database. I am an intensive user of Flickr since early 2005, and published over 25.000 photos there. A large number of those carry a Creative Commons license, BY-NC-SA, meaning that as long as you attribute me, don’t use an image commercially and share your result under the same license you’re allowed to use my photos. As the YFCC-100M covers the years 2004-2014 and I published images for most of those years, it was likely my photos are in it, and by extension likely my photos are in IBM’s DiF. Using NBC’s tool, based on my user name, it turns out 68 of my photos are in IBM’s DiF data set.

One set of photos that apparently is in IBM’s DiF cover the BlogTalk Reloaded conference in Vienna in 2006. There I made various photos of participants and speakers. The NBC tool I mentioned provides one photo from that set as an example:

Thomas Burg

My face is likely in IBM’s DiF
Although IBM doesn’t allow a public check who is in their database, it is very likely that my face is in it. There is a half-way functional way to explore the YFCC-100M database, and DiF is derived from the YFCC-100M. It is reasonable to assume that faces that can be found in YFCC-100M are to be found in IBM’s DiF. The German university of Kaiserslautern at the time created a browser for the YFCC-100M database. Judging by some tests it is far from complete in the results it shows (for instance if I search for my Flickr user name it shows results that don’t contain the example image above and the total number of results is lower than the number of my photos in IBM’s DiF) Using that same browser to search for my name, and for Flickr user names that are likely to have taken pictures of me during the mentioned BlogTalk conference and other conferences, show that there is indeed a number of pictures of my face in YFCC-100M. Although the limited search in IBM’s DiF possible with NBC’s tool doesn’t return any telling results for those Flickr user names. it is very likely my face is in IBM’s DiF therefore. I do find a number of pictures of friends and peers in IBM’s DiF that way, taken at the same time as pictures of myself.


Photos of me in YFCC-100M

But IBM won’t tell you
IBM is disingenuous when it comes to being transparent about what is in their DiF data. Their TOS allows anyone whose Flickr images have been incorporated to request to be excluded from now on, but only if you can provide the exact URLs of the images you want excluded. That is only possible if you can verify what is in their data, but there is no public way to do so, and only university affiliated researchers can request access to the data by stating their research interest. Requests can be denied. Their TOS says:

3.2.4. Upon request from IBM or from any person who has rights to or is the subject of certain images, Licensee shall delete and cease use of images specified in such request.

Time to explore the questions this raises
Now that the context of this data set is clear, in a next posting we can take a closer look at the practical, legal and ethical questions this raises.

In januari schreef ik aangenaam verrast over de Provincie Overijssel die hun iconenset uit de huisstijl onder een Creative Commons licentie hadden gepubliceerd. Ik schreef de Provincie er een complimenterende e-mail over, en stelde de vraag welke Creative Commons licentie er precies bedoeld werd. Want dat was niet duidelijk op de website. Zo was niet helder of naamsvermelding gewenst was, of commercieel hergebruik was toegestaan, en of afgeleid werk onder dezelfde condities moest worden gelicentieerd. Ik kreeg een mail terug met de aankondiging dat ze een aanpassing zouden doen.

Tot mijn verbazing was de aanpassing niet een verduidelijking maar een terugdraaiing van het geheel. De Creative Commons licentie is verdwenen en de site laat nu alleen het gebruik van de iconen toe voor en door de Provincie en hun leveranciers.

Ik stuurde een teleurgestelde e-mail, waarin ik vragen stelde over hoe de nieuwe keuze tot stand gekomen is. Dat wordt natuurlijk al snel een lange mail, omdat het bij dit soort zaken snel over details gaat. Elke vlottere formulering roept dan weer al gauw nieuwe vragen op. Het was dan ook prettig dat een van de communicatie-teamleden me vanmiddag belde om wat context te verschaffen.

Het toevoegen van CC aan de iconen was een door een medewerker gedaan experiment , op basis van ervaringen met eerdere iconen die onder CC beschikbaar waren. De intentie was om CC wat meer gebruik te geven. Dat maakt het bijvoorbeeld ook voor andere overheden makkelijker om dingen van elkaar her te gebruiken. Daar heeft iedereen profijt van. Maar juist bij creatieve uitingen (anders dan bijvoorbeeld bij data waar landelijk beleid geldt t.a.v. CC gebruik) zijn er meer auteursrechtelijke aspecten om rekening mee te houden. Commercieel hergebruik van de creatieve uitingen van een ander zijn dan praktisch en gevoelsmatig een andere stap. We hebben het over de huisstijl van de Provincie, dus wil je wel dat diezelfde iconen ‘overal’ kunnen opduiken? Het is niet de bedoeling dat andermans uitingen met die van jou worden geassocieerd.

Voortschrijdend inzicht op grond van die afwegingen, zijn de oorzaak dat men is teruggekeerd van de oorspronkelijke goede intentie. Dat is goed, al is het resultaat dat er jammer genoeg toch geen CC licentie aan de iconenset hangt. Een experiment is precies dat: een experiment, en dat betekent dat je ook kunt concluderen dat het niet voldeed.

Er zijn natuurlijk opener, minder open, en meer gesloten vormen van CC licenties. Dat is het hele punt van CC: dat je selectief op voorhand voor bepaalde hergebruiksvormen al toestemming verleent, zonder dat iedereen dat bij de auteursrechthebbende moet vragen. Van alle rechten voorbehouden naar sommige rechten voorbehouden.

Het blijft lovenswaardig dat het communicatieteam de intentie had en heeft om met CC te werken. En het is heel prettig dat er contact is opgenomen, dat praat makkelijker. Hopelijk leidt het er toe dat bij een volgende kans er wel een CC licentie gehanteerd kan worden.

In algemene zin, zou het helpen als het Ministerie van BZK, als houder van het dossier rond open overheid en open data, en de directie van decentrale overheden zoals een provincie hier sterker sturend in zouden zijn. Dan zijn experimenten niet nodig, en ontstaat er ook geen angst of zorg op de werkvloer voor mogelijk onbedoelde gevolgen, waardoor je voorzichtige terugtrekkingen als dit krijgt. Die voorzichtigheid is een normale voorspelbare menselijke reactie, maar die kun je in je organisatie onnodig maken. BZK stelt al als beleidslijn dat CC0 en CCBY voor data publicaties gehanteerd moeten worden. Open standaarden zijn al 11 jaren verplicht (maar weinig overheden houden zich daaraan in de praktijk). Het hanteren van een eenduidige praktische interpretatie van de auteurswet ook voor creatieve uitingen van overheden en de daarmee verbonden logische licentiekeuzes door BZK, en het bekrachtigen daarvan door het bestuur van decentrale overheden zou hier helpen. Er is voldoende ervaring inmiddels om het BZK mogelijk te maken hierin normerend op te treden.

After California, now the Washington State senate has adopted a data protection and privacy act that takes the EU General Data Protection Regulation (GDPR) as an example to emulate.

This is definitely a hoped for effect of the GDPR when it was launched. European environmental and food safety standards have had similar global norm setting impact. This as for businesses it generally is more expensive to comply with multiple standards, than it is to only comply with the strictest one. We saw it earlier in companies taking GDPR demands and applying them to themselves generally. That the GDPR might have this impact, is an intentional part of how the EC is developing a third proposition in data geopolitics, between the surveillance capitalism of the US data lakes, and the data driven authoritarianism of China.

To me the GDPR is a quality assurance instrument, with its demands increasing over time. So it is encouraging to see other government entities outside the EU taking a cue from the GDPR. California and Washington State now have adopted similar laws. Five other States in the USA have introduced similar laws for debate in the past 2 months: Hawaii, Massachusetts, New Mexico, Rhode Island, and Maryland.

This article is a good description of the Freedom of Information (#foia #opengov #opendata) situation in the Balkans. Due to my work in the region, I recognise lots of what is described here. My work in the region, such as in Serbia, has let me encounter various institutions willing to use evasive action to prevent the release of information.

In essence this is not all that different from what (decentral) government entities in other European countries do as well. Many of them still see increased transparency and access as a distraction absorbing work and time they’d rather spend elsewhere. Yet, there’s a qualitative difference in the level of obstruction. It’s the difference between acknowledging there is a duty to be transparant but being hesitant, and not believing that there’s such a duty in governance at all.

Secrecy, sometimes in combination with corruption, has a long and deep history. In Central Asia for instance I encountered an example that the number of agricultural machines wasn’t released, as a 1950’s Soviet law still on the books marked it as a state secret (because tractors could be mobilised in case of war). More disturbingly such state secrecy laws are abused to tackle political opponents in Central Asia as well. When a government official releases information based on a transparency regulation, or as part of policy implementation, political opponents might denounce them for giving away state secrets and take them to court risking jail time even.

There is a strong effort to increase transparency visible in the Balkan region as well. Both inside government, as well as in civil society. Excellent examples exist. But it’s an ongoing struggle between those seeing power as its own purpose and those seeking high quality governance. We’ll see steps forward, backwards, rear guard skirmishes and a mixed bag of results for a long time. Especially there where there are high levels of distrust amongst the wider population, not just towards government but towards each other.

One such excellent example is the work of the Serbian information commissioner Sabic. Clearly seeing his role as an ombudsman for the general population, he and his office led by example during the open data work I contributed to in the past years. By publishing statistics on information requests, complaints and answer times, and by publishing a full list of all Serbian institutions that fall under the remit of the Commission for Information of Public Importance and Personal Data Protection. This last thing is key, as some institutions will simply stall requests by stating transparency rules do not apply to them. Mr. Sabic’s term ended at the end of last year. A replacement for his position hasn’t been announced yet, which is both a testament to Mr Sabic’s independent role as information commissioner, and to the risk of less transparency inclined forces trying to get a much less independent successor.

Bookmarked Right to Know: A Beginner’s Guide to State Secrecy / Balkan Insight by Dusica Pavlovic (Balkan Insight)
Governments in the Balkans are chipping away at transparency laws to make it harder for journalists and activists to hold power to account.

SimCity200, adapted from image by m01229 CC-BY)

Came across an interesting article, and by extension the techzine it was published in: Logic.
The article was about the problematic biases and assumptions in the model of urban development used in the popular game SimCity (one of those time sinks where my 10.000 hours brought me nothing 😉 ). And how that unintentionally (the SimCity creator just wanted a fun game) may have influenced how people look at the evolution of cityscapes in real life, in ways the original 1960’s work the game is based on never has. The article is a fine example of cyber history / archeology.

The magazine it was published in, Logic (twitter), started in the spring of 2017 and is now reaching issue 7. Each issue has a specific theme, around which contributions are centered. Intelligence, Tech against Trump, Sex, Justice, Scale, Failure, Play, and soon China, have been the topics until now.

The zine is run by Moira Weigel, Christa Hartsock, Ben Tarnoff, and Jim Fingal.

I’ve ordered the back issues, and subscribed (though technically it is cheaper to keep ordering back-issues). They pay their contributors, which is good.


Cover for the upcoming edition on tech in China. Design (like all design for Logic) by Xiaowei R. Wang.

It obviously makes no sense to block the mail system if you disagree with some of the letters sent. The deceptive method of blocking used here, targeting the back-end servers so that mail traffic simply gets ignored, while Russian Protonmail users still seemingly can access the service, is another sign that they’d rather not let you know blocking goes on at all. This is an action against end-to-end encryption.

The obvious answer is to use more end-to-end encryption, and so increase the cost of surveillance and repression. Use my protonmail address as listed on the right, or use PGP using my public key on the right to contact me. Other means of reaching me with end-to-end encryption are the messaging apps Signal and Threema, as well as Keybase (listed on the right as well).

Bookmarked Russia blocks encrypted email provider ProtonMail (TechCrunch)
Russia has told internet providers to enforce a block against encrypted email provider ProtonMail, the company’s chief has confirmed. The block was ordered by the state Federal Security Service, formerly the KGB, according to a Russian-language blog, which obtained and published the order aft…

Aral Balkan talks about how to design tools and find ways around the big social media platforms. He calls for the design and implementation of Small Tech. I fully agree. Technology to provide us with agency needs to be not just small, but smaller than us, i.e. within the scope of control of the group of people deploying a technology or method.

My original fascination with social media, back in the ’00s when it was blogs and wikis mostly, was precisely because it was smaller than us, it pushed publication and sharing in the hands of all of us, allowing distributed conversations. The concentration of our interaction in the big tech platforms made social media ‘bigger than us’ again. We don’t decide what FB shows us, breaking out of your own bubble (vital in healthy networks) becomes harder because sharing is based on pre-existing ‘friendships’ and discoverability has been removed. The erosion has been slow, but very visible. Networked Agency, to me, is only possible with small tech, and small methods. It’s why I find most ‘digital transformation’ efforts disappointing, and feel we need to focus much more on human digital networks, on distributed digital transformation. Based on federated small tech, networks of small tech instances. Where our tools are useful on their own, and more useful in concert with others.

Aral’s posting (and blog in general) is worth a read, and as he is a coder and designer, he acts on those notions too.