Early last year I wrote about how I don’t track you here, but others might. Third party sites whose content I re-use here by embedding them have the ability to track you to a certain extent. Earlier I already stopped using Slideshare and Scribd completely as a consequence, self-hosting my slide decks from now on.

For photos and videos the story is slightly different. Where it’s not essential that a video can be viewed inside my posting, I simply link to it with a screenshot, thus avoiding that YouTube or Vimeo tracks you on my page. In other cases I still embed the video.

For images I have been using Flickr since 2005. Back then uploading images to my hosting account quickly depleted the available storage space, and Flickr always was a good way to avoid that. I have and am a paying customer of Flickr, even through the years it was also available for free. Flickr is my online third place storage of images (now over 26k), as well as the place where I share those images for others to freely re-use (under Creative Commons licenses).

Embedding my Flickr photos here provides them with the opportunity to track views to the embedded images. The 2005 scarcity in storage space on my web host package is no longer a concern, whereas reducing readers’ exposure to tracking in whatever shape has become more important.

So from the start of the summer vacation I have stopped using Flickr embeds, and all images are and will be hosted on my webserver. The images do link to their counterparts on Flickr. In the case of my own images to point to re-usable versions of the photo, and the rest of my images. In the case of other people’s images I re-use to point to the source and its author. As before I will keep using Flickr to store and share photos.

Over the almost two decades of blogging I’ve embedded hundreds of images from Flickr, and I haven’t replaced those yet. Over time I will. It will become part of my daily routine of checking old postings made on the same day as today.

It makes ‘I don’t track you (but others here might)’ tilt some more towards ‘I don’t track you’ period.

Last week Flickr, under its new ownership, migrated their backend from Yahoo’s datacenters onto Amazon’s AWS. While reading through some old blog posts, I noticed that in the older postings Flickr photos were no longer shown, and were replaced with a browser error message.

It turns out that the player that existed at some point for showing Flickr images as an iframe in posts is now no longer in use. It probably lived somewhere on Yahoo’s servers, and didn’t come along to AWS.

More recently embedded Flickr images are fine as they are based on the actual URI of a Flickr image, which hasn’t changed. It looks like postings from 2011-2014 are the ones hit.

A quick search in the database of this blog for posts that contain ‘iframe’ and ‘flickr’ shows that at most 19 postings (out of 1657) are affected. From the list I recognise some that have a different iframe (e.g. a video or audio fragment) and a separate Flickr reference. I’ll add it to the list of things to fix, and do it manually for the dozen or so where a correction is needed.

This week NBC published an article exploring the source of training data sets for facial recognition. It makes the claim that we ourselves are providing, without consent, the data that may well be used to put us under surveillance.

In January IBM made a database available for research into facial recognition algorithms. The database contains some 1 million face descriptions that can be used as a training set. Called “Diversity in Faces” the stated aim is to reduce bias in current facial recognition abilities. Such bias is rampant often due to too small and too heterogenous (compared to the global population) data sets used in training. That stated goal is ethically sound it seems, but the means used to get there raises a few questions with me. Specifically if the means live up to the same ethical standards that IBM says it seeks to attain with the result of their work. This and the next post explore the origins of the DiF data, my presence in it, and the questions it raises to me.

What did IBM collect in “Diversity in Faces”?
Let’s look at what the data is first. Flickr is a photo sharing site, launched in 2004, that started supporting publishing photos with a Creative Commons license from early on. In 2014 a team led by Bart Thomee at Yahoo, which then owned Flickr, created a database of 100 million photos and videos with any type of Creative Commons license published in previous years on Flickr. This database is available for research purposes and known as the ‘YFCC-100M’ dataset. It does not contain the actual photos or videos per se, but the static metadata for those photos and videos (urls to the image, user id’s, geo locations, descriptions, tags etc.) and the Creative Commons license it was released under. See the video below published at the time:

YFCC100M: The New Data in Multimedia Research from CACM on Vimeo.

IBM used this YFCC-100M data set as a basis, and selected 1 million of the photos in it to build a large collection of human faces. It does not contain the actual photos, but the metadata of that photo, and a large range of some 200 additional attributes describing the faces in those photos, including measurements and skin tones. Where YFC-100M was meant to train more or less any image recognition algorithm, IBM’s derivative subset focuses on faces. IBM describes the dataset in their Terms of Service as:

a list of links (URLs) of Flickr images that are publicly available under certain Creative Commons Licenses (CCLs) and that are listed on the YFCC100M dataset (List of URLs together with coding schemes aimed to provide objective measures of human faces, such as cranio-facial features, as well as subjective annotations, such as human-labeled annotation predictions of age and gender(“Coding Schemes Annotations”). The Coding Schemes Annotations are attached to each URL entry.

My photos are in IBM’s DiF
NBC, in their above mentioned reporting on IBM’s DiF database, provide a little tool to determine if photos you published on Flickr are in the database. I am an intensive user of Flickr since early 2005, and published over 25.000 photos there. A large number of those carry a Creative Commons license, BY-NC-SA, meaning that as long as you attribute me, don’t use an image commercially and share your result under the same license you’re allowed to use my photos. As the YFCC-100M covers the years 2004-2014 and I published images for most of those years, it was likely my photos are in it, and by extension likely my photos are in IBM’s DiF. Using NBC’s tool, based on my user name, it turns out 68 of my photos are in IBM’s DiF data set.

One set of photos that apparently is in IBM’s DiF cover the BlogTalk Reloaded conference in Vienna in 2006. There I made various photos of participants and speakers. The NBC tool I mentioned provides one photo from that set as an example:

My face is likely in IBM’s DiF
Although IBM doesn’t allow a public check who is in their database, it is very likely that my face is in it. There is a half-way functional way to explore the YFCC-100M database, and DiF is derived from the YFCC-100M. It is reasonable to assume that faces that can be found in YFCC-100M are to be found in IBM’s DiF. The German university of Kaiserslautern at the time created a browser for the YFCC-100M database. Judging by some tests it is far from complete in the results it shows (for instance if I search for my Flickr user name it shows results that don’t contain the example image above and the total number of results is lower than the number of my photos in IBM’s DiF) Using that same browser to search for my name, and for Flickr user names that are likely to have taken pictures of me during the mentioned BlogTalk conference and other conferences, show that there is indeed a number of pictures of my face in YFCC-100M. Although the limited search in IBM’s DiF possible with NBC’s tool doesn’t return any telling results for those Flickr user names. it is very likely my face is in IBM’s DiF therefore. I do find a number of pictures of friends and peers in IBM’s DiF that way, taken at the same time as pictures of myself.

Photos of me in YFCC-100M

But IBM won’t tell you
IBM is disingenuous when it comes to being transparent about what is in their DiF data. Their TOS allows anyone whose Flickr images have been incorporated to request to be excluded from now on, but only if you can provide the exact URLs of the images you want excluded. That is only possible if you can verify what is in their data, but there is no public way to do so, and only university affiliated researchers can request access to the data by stating their research interest. Requests can be denied. Their TOS says:

3.2.4. Upon request from IBM or from any person who has rights to or is the subject of certain images, Licensee shall delete and cease use of images specified in such request.

Time to explore the questions this raises
Now that the context of this data set is clear, in a next posting we can take a closer look at the practical, legal and ethical questions this raises.

This is good news from Flickr. Flickr is amending their changes, to ensure that Creative Commons licensed photos will not be deleted from free accounts that are over their limit. (via Jeremy Cherfas)

Flickr recently announced they would be deleting the oldest photos of free accounts that have more photos than the new limit of 1000 images. This caused concern as some of those free accounts might be old, disused accounts, where there are images with open licenses that are being used elsewhere. Flickr allows search for images with open licenses, and makes it easy to embed those in your own online material. Removing old images might therefore break things, and there were many people calling for Flickr to try and prevent breaking things. And they are, Flickr is providing all public institutions and archives publishing photos to Flickr with a free Pro account, and will also delete no Creative Commons licensed images, if they carried that license before 1 November 2018. (This prevents you from keeping an unlimited free account by simply relicensing the photos, and uploading new photos with CC licenses only.)

At the end of this month my Flickr Pro account will come up for renewal. It turns out that they doubled the price last summer (from 25 to 50 USD/year). Recently Flickr also announced that free accounts will be limited starting January 1st, 2019. The new limit will be at 1000 photos, and accounts with more images will see their oldest images deleted.

I have been a Flickr Pro member since early 2005, and store some 25.000 photos there, making up 75GB. I took a paid account in 2005 because back then they had a 200 photo cap for free accounts, and I easily reached that limit.

With a rate change like this it is a good time to evaluate whether the service is still good enough for me to keep at the new rate.

How do I use Flickr?

  • It’s an off-site back-up of photos
  • I use it to find Creative Commons licensed material for my presentations
  • I publish photos under such a license myself, to enable others to use them
  • I embed Flickr photos in my blogposts, so I do not have to store them on my hosting account (which has much less storage, 3GB)
  • I use it to quickly find things back in my own photos, through its album structure and search. “Don’t I have a picture of that building from when I visited that conference in Copenhagen a few years ago?”

So if I would want to replace Flickr, e.g. by bringing it home to something more under my control, what would that need to look like?

  • For the off-site storage I could easily find cheaper alternatives, in fact I already run several of them where there’s still over 50TB of total storage available.
  • Finding CC images on Flickr is still possible if you’re not a registered member, but it misses some showing me photos by myself and those I’m connected to first. I have a preference for using photos from my network.
  • Contributing CC images is important to me, also as I feel reciprocity is important, as I do use CC images by others a lot too. I don’t know of any other place where I could add CC licenses to my photos that casually. I have seen places where you’d pool your curated images under CC but that is additional work. Part of the utility is to automatically add CC licenses to everything I store online. Maybe some of you know an alternative?
  • Embedding photos easily at various formats (using HTML only, as I strip out the javascript stuff Flickr also provides) is something I have no ideas for an alternative currently. Probably it would mean exposing a replacement storage to the public, but not sure how to replace resizing on the fly. I could also try and do what Peter did, replacing all currently embedded photos on my blog, the photos I made at least, with locally hosted ones. It would solve this for the past, but not for the future.
  • Search replacement like embedding replacement would depend on having public storage, and would require keeping the album structure, added titles, geo-locations etc. That added metadata (80% of my photos have tags and geotags) were all added manually during upload, geotags mostly added manually, some automatically)

The ‘cost of leaving’ is mostly sunk efforts like added titles, tags and locations. So even if it feels differently, that is not a rational consideration to keep an account. Especially not as you can download all Flickr material including that metadata, so it is more about how you would make that metadata useful in a new set-up. The decision to make is if I want to find and set-up a workable alternative in the coming three weeks to save 100 USD, or do I buy myself 2 years of time with those 100USD?

If you have left Flickr in the past few years, what does your current workflow around photos look like?

Many tech companies are rushing to arrange compliance with GDPR, Europe’s new data protection regulations. What I have seen landing in my inbox thus far is not encouraging. Like with Facebook, other platforms clearly struggle, or hope to get away, with partially or completely ignoring the concepts of informed consent and unforced consent and proving consent. One would suspect the latter as Facebooks removal of 1.5 billion users from EU jurisdiction, is a clear step to reduce potential exposure.

Where consent by the data subject is the basis for data collection: Informed consent means consent needs to be explicitly given for each specific use of person related data, based on a for laymen clear explanation of the reason for collecting the data and how precisely it will be used.
Unforced means consent cannot be tied to core services of the controlling/processing company when that data isn’t necessary to perform a service. In other words “if you don’t like it, delete your account” is forced consent. Otherwise, the right to revoke one or several consents given becomes impossible.
Additionally, a company needs to be able to show that consent has been given, where consent is claimed as the basis for data collection.

Instead I got this email from Twitter earlier today:

“We encourage you to read both documents in full, and to contact us as described in our Privacy Policy if you have questions.”

and then

followed by

You can also choose to deactivate your Twitter account.

The first two bits mean consent is not informed and that it’s not even explicit consent, but merely assumed consent. The last bit means it is forced. On top of it Twitter will not be able to show consent was given (as it is merely assumed from using their service). That’s not how this is meant to work. Non-compliant in other words. (IANAL though)