This week NBC published an article exploring the source of training data sets for facial recognition. It makes the claim that we ourselves are providing, without consent, the data that may well be used to put us under surveillance.

In January IBM made a database available for research into facial recognition algorithms. The database contains some 1 million face descriptions that can be used as a training set. Called “Diversity in Faces” the stated aim is to reduce bias in current facial recognition abilities. Such bias is rampant often due to too small and too heterogenous (compared to the global population) data sets used in training. That stated goal is ethically sound it seems, but the means used to get there raises a few questions with me. Specifically if the means live up to the same ethical standards that IBM says it seeks to attain with the result of their work. This and the next post explore the origins of the DiF data, my presence in it, and the questions it raises to me.

What did IBM collect in “Diversity in Faces”?
Let’s look at what the data is first. Flickr is a photo sharing site, launched in 2004, that started supporting publishing photos with a Creative Commons license from early on. In 2014 a team led by Bart Thomee at Yahoo, which then owned Flickr, created a database of 100 million photos and videos with any type of Creative Commons license published in previous years on Flickr. This database is available for research purposes and known as the ‘YFCC-100M’ dataset. It does not contain the actual photos or videos per se, but the static metadata for those photos and videos (urls to the image, user id’s, geo locations, descriptions, tags etc.) and the Creative Commons license it was released under. See the video below published at the time:

YFCC100M: The New Data in Multimedia Research from CACM on Vimeo.

IBM used this YFCC-100M data set as a basis, and selected 1 million of the photos in it to build a large collection of human faces. It does not contain the actual photos, but the metadata of that photo, and a large range of some 200 additional attributes describing the faces in those photos, including measurements and skin tones. Where YFC-100M was meant to train more or less any image recognition algorithm, IBM’s derivative subset focuses on faces. IBM describes the dataset in their Terms of Service as:

a list of links (URLs) of Flickr images that are publicly available under certain Creative Commons Licenses (CCLs) and that are listed on the YFCC100M dataset (List of URLs together with coding schemes aimed to provide objective measures of human faces, such as cranio-facial features, as well as subjective annotations, such as human-labeled annotation predictions of age and gender(“Coding Schemes Annotations”). The Coding Schemes Annotations are attached to each URL entry.

My photos are in IBM’s DiF
NBC, in their above mentioned reporting on IBM’s DiF database, provide a little tool to determine if photos you published on Flickr are in the database. I am an intensive user of Flickr since early 2005, and published over 25.000 photos there. A large number of those carry a Creative Commons license, BY-NC-SA, meaning that as long as you attribute me, don’t use an image commercially and share your result under the same license you’re allowed to use my photos. As the YFCC-100M covers the years 2004-2014 and I published images for most of those years, it was likely my photos are in it, and by extension likely my photos are in IBM’s DiF. Using NBC’s tool, based on my user name, it turns out 68 of my photos are in IBM’s DiF data set.

One set of photos that apparently is in IBM’s DiF cover the BlogTalk Reloaded conference in Vienna in 2006. There I made various photos of participants and speakers. The NBC tool I mentioned provides one photo from that set as an example:

My face is likely in IBM’s DiF
Although IBM doesn’t allow a public check who is in their database, it is very likely that my face is in it. There is a half-way functional way to explore the YFCC-100M database, and DiF is derived from the YFCC-100M. It is reasonable to assume that faces that can be found in YFCC-100M are to be found in IBM’s DiF. The German university of Kaiserslautern at the time created a browser for the YFCC-100M database. Judging by some tests it is far from complete in the results it shows (for instance if I search for my Flickr user name it shows results that don’t contain the example image above and the total number of results is lower than the number of my photos in IBM’s DiF) Using that same browser to search for my name, and for Flickr user names that are likely to have taken pictures of me during the mentioned BlogTalk conference and other conferences, show that there is indeed a number of pictures of my face in YFCC-100M. Although the limited search in IBM’s DiF possible with NBC’s tool doesn’t return any telling results for those Flickr user names. it is very likely my face is in IBM’s DiF therefore. I do find a number of pictures of friends and peers in IBM’s DiF that way, taken at the same time as pictures of myself.


Photos of me in YFCC-100M

But IBM won’t tell you
IBM is disingenuous when it comes to being transparent about what is in their DiF data. Their TOS allows anyone whose Flickr images have been incorporated to request to be excluded from now on, but only if you can provide the exact URLs of the images you want excluded. That is only possible if you can verify what is in their data, but there is no public way to do so, and only university affiliated researchers can request access to the data by stating their research interest. Requests can be denied. Their TOS says:

3.2.4. Upon request from IBM or from any person who has rights to or is the subject of certain images, Licensee shall delete and cease use of images specified in such request.

Time to explore the questions this raises
Now that the context of this data set is clear, in a next posting we can take a closer look at the practical, legal and ethical questions this raises.

Earlier this week I participated in a general workshop for the Future Workspace research consortium that I have been contributing to in the past months. The consortium is otherwise made up of the Telematica Institute, IBM, Rabobank, Royal Haskoning, CETIM, Free University of Amsterdam and Delft University of Technology.

This week’s workshop was an open invitation workshop around the use of social media in enterprise, organized by the Telematica Institute and hosted by IBM in Amsterdam. Questions around adoption, governance, selection of tools, and integration in existing ICT architecture, were discussed in a Knowledge Café format.

Before the actual discussions and conversations, a short presentation was given Erik Krischan on how social media are currently used within the IBM intranet. (Showing us the intranet in Firefox btw) A short list of things that caught my eye:

– RSS and tags are used throughout
– There seemed to be a bit of confusion between the terms tag and bookmark, which were used in part as synonyms
– It all looked very ‘portal’ like and text based
– By choice there is no single sign-on (to prevent all kinds of global architectural/integration questions)
– They link to communities of practice and people wherever that is helpful, adding human context to information
– There are rating systems
– People are shown to you in degrees of separation, and there is a recommended ‘social path‘ to people
– There are experiments with visualizing social network analysis results (with opt-in crawling of your e-mail)
– New applications are only seeded with starting money, then fend for themselves to get adoption from colleagues
BlueTwit, is IBM’s behind the firewall Twitter-like application (next to regular IM of course) (no surprise to see Luis Suarez/@elsua in that stream 🙂 )
– ‘IBM Whisper’ automatically suggests people and pieces of information to you based on your use of the intranet


Erik Krischan showing IBM web 20 enabled intranet

It is clear that IBM does a lot of ‘safe-fail’ experimenting with social media style functionality and applications in their intranet environment. It is less clear to me how consolidation is organized, as that was not part of the presentation and following discussion. It seems to me to already be a real patchwork of apps (mind you, I am no stranger to patchwork), although there are also signs of integration and consolidation. But what stood out most for me is how the ‘new stuff’  is often still presented as ‘seperate’.

A good example of that were how search results were presented. It had the usual search results with % of relevance. (The search term was portal, and yielded documents from 2004 and 2006 as most relevant results) And next to it people relevant to the search term. But then other results were not presented in terms of content or context, but in terms of channel/applications. There were boxes with ‘rss results’ and ‘bookmarks found’. That is like having seperate boxes for stuff that you heard on the telephone, or received through fax, or over a coffee in the hallway. For me as a person working on my tasks the information source is important, not channel of delivery. That does not help me filter, authenticate, or validate. It would be helpful if all those search results were in the same list (with a hint to channel displayed next to it: external blog, bookmarked by colleague) and subject to the same type of rating system.

So while IBM certainly has a lot of very very cool stuff on their intranet, making quite a number of participants drool and speak of ‘information nirvana’, I think there is one fundamental barrier in the overall approach and design however, and that is the focus on individual information items. Only then would you end up with a seperate box for rss search results, and bookmark search results, or search results tagged with your search term. That information focus is a legacy notion from earlier days. People don’t need ‘information nirvana’, they need more ‘flow nirvana’, that will help them do their work to the best of their professional standards. That is more likely to be achieved when you take the tasks people are trying to do, the context and complex characteristics of their work, more as a starting point than the distribution of ‘information items’.  In that sense the mentioned ‘Whisper’ functionality is significant, and could serve as starting point for more. Being able to create your own starting page with widgets and applets is a good start too as is possible on IBM’s intranet, if those widgets and apps are more functional building blocks, and less seperated along the lines of channels or ‘technology used under the hood to get this to you’. Because the latter seems to signify that somehow different channels are less valuable/trustworthy, whereas that has/should have nothing to do with value of information.


General conversation round

After Erik’s presentation it was Mireille Jansma who guided us through the Knowledge Café format (and told us a little something on how she and her colleagues see the possible role of social media in ING) Good to see Mireille and Jurgen Egges again, whom I both recently met in the context of a Cognitive Edge course and meeting with Dave Snowden. All in all a good session. Photos on Flickr.

Samuel Driessen also blogged his impressions, and spends a bit more time reflecting on the conversations in the Knowledge Café.


Continued conversations during lunch