This week NBC published an article exploring the source of training data sets for facial recognition. It makes the claim that we ourselves are providing, without consent, the data that may well be used to put us under surveillance.

In January IBM made a database available for research into facial recognition algorithms. The database contains some 1 million face descriptions that can be used as a training set. Called “Diversity in Faces” the stated aim is to reduce bias in current facial recognition abilities. Such bias is rampant often due to too small and too heterogenous (compared to the global population) data sets used in training. That stated goal is ethically sound it seems, but the means used to get there raises a few questions with me. Specifically if the means live up to the same ethical standards that IBM says it seeks to attain with the result of their work. This and the next post explore the origins of the DiF data, my presence in it, and the questions it raises to me.

What did IBM collect in “Diversity in Faces”?
Let’s look at what the data is first. Flickr is a photo sharing site, launched in 2004, that started supporting publishing photos with a Creative Commons license from early on. In 2014 a team led by Bart Thomee at Yahoo, which then owned Flickr, created a database of 100 million photos and videos with any type of Creative Commons license published in previous years on Flickr. This database is available for research purposes and known as the ‘YFCC-100M’ dataset. It does not contain the actual photos or videos per se, but the static metadata for those photos and videos (urls to the image, user id’s, geo locations, descriptions, tags etc.) and the Creative Commons license it was released under. See the video below published at the time:

YFCC100M: The New Data in Multimedia Research from CACM on Vimeo.

IBM used this YFCC-100M data set as a basis, and selected 1 million of the photos in it to build a large collection of human faces. It does not contain the actual photos, but the metadata of that photo, and a large range of some 200 additional attributes describing the faces in those photos, including measurements and skin tones. Where YFC-100M was meant to train more or less any image recognition algorithm, IBM’s derivative subset focuses on faces. IBM describes the dataset in their Terms of Service as:

a list of links (URLs) of Flickr images that are publicly available under certain Creative Commons Licenses (CCLs) and that are listed on the YFCC100M dataset (List of URLs together with coding schemes aimed to provide objective measures of human faces, such as cranio-facial features, as well as subjective annotations, such as human-labeled annotation predictions of age and gender(“Coding Schemes Annotations”). The Coding Schemes Annotations are attached to each URL entry.

My photos are in IBM’s DiF
NBC, in their above mentioned reporting on IBM’s DiF database, provide a little tool to determine if photos you published on Flickr are in the database. I am an intensive user of Flickr since early 2005, and published over 25.000 photos there. A large number of those carry a Creative Commons license, BY-NC-SA, meaning that as long as you attribute me, don’t use an image commercially and share your result under the same license you’re allowed to use my photos. As the YFCC-100M covers the years 2004-2014 and I published images for most of those years, it was likely my photos are in it, and by extension likely my photos are in IBM’s DiF. Using NBC’s tool, based on my user name, it turns out 68 of my photos are in IBM’s DiF data set.

One set of photos that apparently is in IBM’s DiF cover the BlogTalk Reloaded conference in Vienna in 2006. There I made various photos of participants and speakers. The NBC tool I mentioned provides one photo from that set as an example:

Thomas Burg

My face is likely in IBM’s DiF
Although IBM doesn’t allow a public check who is in their database, it is very likely that my face is in it. There is a half-way functional way to explore the YFCC-100M database, and DiF is derived from the YFCC-100M. It is reasonable to assume that faces that can be found in YFCC-100M are to be found in IBM’s DiF. The German university of Kaiserslautern at the time created a browser for the YFCC-100M database. Judging by some tests it is far from complete in the results it shows (for instance if I search for my Flickr user name it shows results that don’t contain the example image above and the total number of results is lower than the number of my photos in IBM’s DiF) Using that same browser to search for my name, and for Flickr user names that are likely to have taken pictures of me during the mentioned BlogTalk conference and other conferences, show that there is indeed a number of pictures of my face in YFCC-100M. Although the limited search in IBM’s DiF possible with NBC’s tool doesn’t return any telling results for those Flickr user names. it is very likely my face is in IBM’s DiF therefore. I do find a number of pictures of friends and peers in IBM’s DiF that way, taken at the same time as pictures of myself.


Photos of me in YFCC-100M

But IBM won’t tell you
IBM is disingenuous when it comes to being transparent about what is in their DiF data. Their TOS allows anyone whose Flickr images have been incorporated to request to be excluded from now on, but only if you can provide the exact URLs of the images you want excluded. That is only possible if you can verify what is in their data, but there is no public way to do so, and only university affiliated researchers can request access to the data by stating their research interest. Requests can be denied. Their TOS says:

3.2.4. Upon request from IBM or from any person who has rights to or is the subject of certain images, Licensee shall delete and cease use of images specified in such request.

Time to explore the questions this raises
Now that the context of this data set is clear, in a next posting we can take a closer look at the practical, legal and ethical questions this raises.

Some of the things I found worth reading in the past few days:

  • Although this article seems to confuse regulatory separation with technological separation, it does give a try in formulating the geopolitical aspects of internet and data: There May Soon Be Three Internets. America’s Won’t Necessarily Be the Best
  • Interesting, yet basically boils down to actively exercising your ‘free will’. It assumes a blank slate for the hacking, where I haven’t deliberately set out for information/contacts on certain topics. And then it suggests doing precisely that as remedy. The key quote for me here is “Humans are hacked through pre-existing fears, hatreds, biases and cravings. Hackers cannot create fear or hatred out of nothing. But when they discover what people already fear and hate it is easy to push the relevant emotional buttons and provoke even greater fury. If people cannot get to know themselves by their own efforts, perhaps the same technology the hackers use can be turned around and serve to protect us. Just as your computer has an antivirus program that screens for malware, maybe we need an antivirus for the brain. Your AI sidekick will learn by experience that you have a particular weakness – whether for funny cat videos or for infuriating Trump stories – and would block them on your behalf.“: Yuval Noah Harari on the myth of freedom
  • This is an important issue, always. I recognise it from my work for the World Bank and UN agencies. Is what you’re doing actually helping, or is it shoring up authorities that don’t match with your values? And are you able to recognise it and withdraw when you cross the line from the former to the latter? I’ve known entrepreneurs who kept a client blacklist of sectors, governments and companies, but often it isn’t that clear cut. I’ve avoided engagements in various countries over the years, but every client engagement can be rationalised: How McKinsey Has Helped Raise the Stature of Authoritarian Governments, and when the consequences come back to bite, Malaysia files charges against Goldman-Sachs
  • This seems like a useful list to check for next books to read. I’ve definitely enjoyed reading the work of Chimamanda Ngozi Adichie and Nnedi Okorafor last year: My year of reading African women, by Gary Younge

Some things I thought worth reading in the past days

  • A good read on how currently machine learning (ML) merely obfuscates human bias, by moving it to the training data and coding, to arrive at peace of mind from pretend objectivity. Because of claiming that it’s ‘the algorithm deciding’ you make ML a kind of digital alchemy. Introduced some fun terms to me, like fauxtomation, and Potemkin AI: Plausible Disavowal – Why pretend that machines can be creative?
  • These new Google patents show how problematic the current smart home efforts are, including the precursor that are the Alexa and Echo microphones in your house. They are stripping you of agency, not providing it. These particular ones also nudge you to treat your children much the way surveillance capitalism treats you: as a suspect to be watched, relationships denuded of the subtle human capability to trust. Agency only comes from being in full control of your tools. Adding someone else’s tools (here not just Google but your health insurer, your landlord etc) to your home doesn’t make it smart but a self-censorship promoting escape room. A fractal of the panopticon. We need to start designing more technology that is based on distributed use, not on a centralised controller: Google’s New Patents Aim to Make Your Home a Data Mine
  • An excellent article by the NYT about Facebook’s slide to the dark side. When the student dorm room excuse “we didn’t realise, we messed up, but we’ll fix it for the future” defence fails, and you weaponise your own data driven machine against its critics. Thus proving your critics right. Weaponising your own platform isn’t surprising but very sobering and telling. Will it be a tipping point in how the public views FB? Delay, Deny and Deflect: How Facebook’s Leaders Fought Through Crisis
  • Some of these takeaways from the article just mentioned we should keep top of mind when interacting with or talking about Facebook: FB knew very early on about being used to influence the US 2016 election and chose not to act. FB feared backlash from specific user groups and opted to unevenly enforce their terms or service/community guidelines. Cambridge Analytica is not an isolated abuse, but a concrete example of the wider issue. FB weaponised their own platform to oppose criticism: How Facebook Wrestled With Scandal: 6 Key Takeaways From The Times’s Investigation
  • There really is no plausible deniability for FB’s execs on their “in-house fake news shop” : Facebook’s Top Brass Say They Knew Nothing About Definers. Don’t Believe Them. So when you need to admit it, you fall back on the ‘we messed up, we’ll do better going forward’ tactic.
  • As Aral Balkan says, that’s the real issue at hand because “Cambridge Analytica and Facebook have the same business model. If Cambridge Analytica can sway elections and referenda with a relatively small subset of Facebook’s data, imagine what Facebook can and does do with the full set.“: We were warned about Cambridge Analytica. Why didn’t we listen?
  • [update] Apparently all the commotion is causing Zuckerberg to think FB is ‘at war‘, with everyone it seems, which is problematic for a company that has as a mission to open up and connect the world, and which is based on a perception of trust. Also a bunker mentality probably doesn’t bode well for FB’s corporate culture and hence future: Facebook At War.

Some links I thought worth reading the past few days

  • Peter Rukavina pointed me to this excellent posting on voting, in the context of violence as a state monopoly and how that vote contributes to violence. It’s this type of long form blogging that I often find so valuable as it shows you the detailed reasoning of the author. Where on FB or Twitter would you find such argumentation, and how would it ever surface in a algorithmic timeline? Added Edward Hasbrouck to my feedreader : The Practical Nomad blog: To vote, or not to vote?
  • This quote is very interesting. Earlier in the conversation Stephen Downes mentions “networks are grown, not constructed”. (true for communities too). Tanya Dorey adds how from a perspective of indigenous or other marginalised groups ‘facts’ my be different, and that arriving a truth therefore is a process: “For me, “truth growing” needs to involve systems, opportunities, communities, networks, etc. that cause critical engagement with ideas, beliefs and ways of thinking that are foreign, perhaps even contrary to our own. And not just on the content level, but embedded within the fabric of the system et al itself.“: A conversation during EL30.mooc.ca on truth, data, networks and graphs.
  • This article has a ‘but’ title, but actually is a ‘yes, and’. Saying ethics isn’t enough because we also need “A society-wide debate on values and on how we want to live in the digital age” is saying the same thing. The real money quote though is “political parties should be able to review technology through the lens of their specific world-views and formulate political positions accordingly. A party that has no position on how their values relate to digital technology or the environment cannot be expected to develop any useful agenda for the challenges we are facing in the 21st century.” : Gartner calls Digital Ethics a strategic trend for 2019 – but ethics are not enough
  • A Dutch essay on post-truth. Says it’s not the end of truth that’s at issue but rather that everyone claims it for themselves. Pits Foucault’s parrhesia, speaking truth to power against the populists : Waarheidsspreken in tijden van ‘post-truth’: Foucault, ‘parrèsia’ en populisme
  • When talking about networked agency and specifically resilience, increasingly addressing infrastructure dependencies gets important. When you run decentralised tools so that your instance is still useful when others are down, then all of a sudden your ISP and energy supplier are a potential risk too: disaster.radio | a disaster-resilient communications network powered by the sun
  • On the amplification of hate speech. It’s not about the speech to me, but about the amplification and the societal acceptability that signals, and illusion of being mainstream it creates: Opinion | I Thought the Web Would Stop Hate, Not Spread It
  • One of the essential elements of the EU GDPR is that it applies to anyone having data about EU citizens. As such it can set a de facto standard globally. As with environmental standards market players will tend to use one standard, not multiple for their products, and so the most stringent one is top of the list. It’s an element in how data is of geopolitical importance these days. This link is an example how GDPR is being adopted in South-Africa : Four essential pillars of GDPR compliance
  • A great story how open source tools played a key role in dealing with the Sierra Leone Ebola crisis a few years ago: How Open Source Software Helped End Ebola – iDT Labs – Medium
  • This seems like a platform of groups working towards their own networked agency, solving issues for their own context and then pushing them into the network: GIG – we are what we create together
  • An article on the limits on current AI, and the elusiveness of meaning: Opinion | Artificial Intelligence Hits the Barrier of Meaning

Some links I thought worth reading the past few days

Some links I think worth reading today.

I an open letter (PDF) a range of institutions call upon their respective European governments to create ELLIS, the European Lab for Learning and Intelligent Systems. It’s an effort to fortify against brain drain, and instead attract top talent to Europe. It points to the currently weak position in AI of Europe between what is happening in the USA and in China, adding a geo-political dimension. The letter calls not so much for an institution with a large headcount, but for commitment to long term funding to attract and keep the right people. These are similar reasons that led to the founding of CERN, now a global center for physics (and a key driver of things like open access to research and open research data), and more recently the European Molecular Biology Laboratory.

At the core the signatories see France and Germany as most likely to act to start this intra-governmental initiative. It seems this nicely builds upon the announcement by French president Macron late March to invest heavily in AI, and keep / attract the right people for it. He too definitely sees the European dimension to this, even puts European and enlightenment values at the core of it, although he acted within his primary scope of agency, France itself.

(via this Guardian article)

Some links I think worth reading today.

Data, especially lots of it, is the feedstock of machine learning and algorithms. And there’s a race on for who will lead in these fields. This gives it a geopolitical dimension, and makes data a key strategic resource of nations. In between the vast data lakes in corporate silos in the US and the national data spaces geared towards data driven authoritarianism like in China, what is the European answer, what is the proposition Europe can make the world? Ethics based AI. “Enlightenment Inside”.

French President Macron announced spending 1.5 billion in the coming years on AI last month. Wired published an interview with Macron. Below is an extended quote of I think key statements.

AI will raise a lot of issues in ethics, in politics, it will question our democracy and our collective preferences……It could totally dismantle our national cohesion and the way we live together. This leads me to the conclusion that this huge technological revolution is in fact a political revolution…..Europe has not exactly the same collective preferences as US or China. If we want to defend our way to deal with privacy, our collective preference for individual freedom versus technological progress, integrity of human beings and human DNA, if you want to manage your own choice of society, your choice of civilization, you have to be able to be an acting part of this AI revolution . That’s the condition of having a say in designing and defining the rules of AI. That is one of the main reasons why I want to be part of this revolution and even to be one of its leaders. I want to frame the discussion at a global scale….The key driver should not only be technological progress, but human progress. This is a huge issue. I do believe that Europe is a place where we are able to assert collective preferences and articulate them with universal values.

Macron’s actions are largely based on the report by French MP and Fields Medal winning mathematician Cédric Villani, For a Meaningful Artificial Intelligence (PDF)