Bookmarked Dust Rising: Machine learning and the ontology of the real (by David Weinberger)

I am looking forward to reading this. Will need to put aside some time to be able to really focus, given the author, and the amount of time taken to write it.

…an article I worked on for a couple of years. It’s only 2,200 words, but they were hard words to find because the ideas were, and are, hard for me. … The article argues, roughly, that the sorts of generalizations that machine learning models embody are very different from the sort of generalizations the West has taken as the truths that matter.

David Weinberger

Notes on reading Novacene by James Lovelock 2019

Definition of life: entities that reduce entropy, as they organise their environment

I knew his 1970s Gaia Theory, but remembered it mostly as a type of systems thinking and seeing earth as a complex system. But he adds something key:

In earth’s case the purpose of the system is to keep earth cool, to keep temperatures at 15C average. And do so as our sun slowly heats up.

A startling assumption to me is that earth really is not in the Goldilocks zone, but Mars is. We would be like Venus, hot, if not for the entropy reducing earth life. That life continuously draws down heat.
Furthermore, to an alien observer earth would not look cool but much hotter because of dumpng solar heat continuously.

The sun is heating up and so is earth therefore. Keeping cool is our prime directive. The climate urgency is making it worse and burning fossil fuells (stored heat from the past) should stop.

The Anthropocene started with the steam engine, when humans could influence their environment on a global scale. The Novacene is the coming age of AI.
The optimal temperature range for electronics and life are similar, and life and AI have the same hard upper temperature limit of 47C.
Above it we will have a runaway process to becoming like Venus.

AI will not deliberately kill us because it needs the world to stay cool under a heating sun. Carbon based life is needed for it. They will supplant us by evolution, slow not sudden, as evolution moves beyond us, as it always would.

Interesting notion: AI might become 1M times faster than us, but they are bound by the same physics as us. It means e.g. their travel will be at roughly the same speed.
Which will be 1M as boring and slower to AI than to us.

Makes a good caveat: AI would need to start its evolution from ‘good’ beginnings. E.g. not from autonomous weapons platforms.
Yet precisely in civic tech such as aviation we put hard constraints on AI. But we do not on military AI, making it more likely it will evolve from there.

My takeaway from this is how to use AI for civic tech, and set it free as it were, with a sense of communal values. Including with a sense of the Prime Directive to keep cool.

That I think is a core flaw in Lovelocks reasoning. Yes, the PD is to keep cool. Not only for our self-created heating, but mostly for the sun heating.
But how many humans are aware of this, and of those how many care enough to act, given the timescale of the suns heating in millions of years?
How will we make AI aware, and will they care where we do not, given that their relative timescale is even up to a million times longer?

He stresses the notion of the engineer and artisanal engineering. Where knowing how to make things work is a priori more important than knowing why it works.
This also ties into his notion that intuiting is key for engineering, and the scientific method of standing on the shoulders of others is more suited for the ‘know why’

Some of my takeaways:

  • When increasing the abundance of life is good to keep cool, greening your urban living environment makes sense on a deeper level than just cooling the city.
    Also as cities are an efficient way to house us humans at our current numbers.
  • How to use ML for civic tech, for networked agency
  • How to explore ML, what it currently does, what it can do, areas of issues it could be used in.
  • What autonomous things would be valuable in the home, neighbourhood, city.
  • What would an “AI in the wall” be like?

This week NBC published an article exploring the source of training data sets for facial recognition. It makes the claim that we ourselves are providing, without consent, the data that may well be used to put us under surveillance.

In January IBM made a database available for research into facial recognition algorithms. The database contains some 1 million face descriptions that can be used as a training set. Called “Diversity in Faces” the stated aim is to reduce bias in current facial recognition abilities. Such bias is rampant often due to too small and too heterogenous (compared to the global population) data sets used in training. That stated goal is ethically sound it seems, but the means used to get there raises a few questions with me. Specifically if the means live up to the same ethical standards that IBM says it seeks to attain with the result of their work. This and the next post explore the origins of the DiF data, my presence in it, and the questions it raises to me.

What did IBM collect in “Diversity in Faces”?
Let’s look at what the data is first. Flickr is a photo sharing site, launched in 2004, that started supporting publishing photos with a Creative Commons license from early on. In 2014 a team led by Bart Thomee at Yahoo, which then owned Flickr, created a database of 100 million photos and videos with any type of Creative Commons license published in previous years on Flickr. This database is available for research purposes and known as the ‘YFCC-100M’ dataset. It does not contain the actual photos or videos per se, but the static metadata for those photos and videos (urls to the image, user id’s, geo locations, descriptions, tags etc.) and the Creative Commons license it was released under. See the video below published at the time:

YFCC100M: The New Data in Multimedia Research from CACM on Vimeo.

IBM used this YFCC-100M data set as a basis, and selected 1 million of the photos in it to build a large collection of human faces. It does not contain the actual photos, but the metadata of that photo, and a large range of some 200 additional attributes describing the faces in those photos, including measurements and skin tones. Where YFC-100M was meant to train more or less any image recognition algorithm, IBM’s derivative subset focuses on faces. IBM describes the dataset in their Terms of Service as:

a list of links (URLs) of Flickr images that are publicly available under certain Creative Commons Licenses (CCLs) and that are listed on the YFCC100M dataset (List of URLs together with coding schemes aimed to provide objective measures of human faces, such as cranio-facial features, as well as subjective annotations, such as human-labeled annotation predictions of age and gender(“Coding Schemes Annotations”). The Coding Schemes Annotations are attached to each URL entry.

My photos are in IBM’s DiF
NBC, in their above mentioned reporting on IBM’s DiF database, provide a little tool to determine if photos you published on Flickr are in the database. I am an intensive user of Flickr since early 2005, and published over 25.000 photos there. A large number of those carry a Creative Commons license, BY-NC-SA, meaning that as long as you attribute me, don’t use an image commercially and share your result under the same license you’re allowed to use my photos. As the YFCC-100M covers the years 2004-2014 and I published images for most of those years, it was likely my photos are in it, and by extension likely my photos are in IBM’s DiF. Using NBC’s tool, based on my user name, it turns out 68 of my photos are in IBM’s DiF data set.

One set of photos that apparently is in IBM’s DiF cover the BlogTalk Reloaded conference in Vienna in 2006. There I made various photos of participants and speakers. The NBC tool I mentioned provides one photo from that set as an example:

My face is likely in IBM’s DiF
Although IBM doesn’t allow a public check who is in their database, it is very likely that my face is in it. There is a half-way functional way to explore the YFCC-100M database, and DiF is derived from the YFCC-100M. It is reasonable to assume that faces that can be found in YFCC-100M are to be found in IBM’s DiF. The German university of Kaiserslautern at the time created a browser for the YFCC-100M database. Judging by some tests it is far from complete in the results it shows (for instance if I search for my Flickr user name it shows results that don’t contain the example image above and the total number of results is lower than the number of my photos in IBM’s DiF) Using that same browser to search for my name, and for Flickr user names that are likely to have taken pictures of me during the mentioned BlogTalk conference and other conferences, show that there is indeed a number of pictures of my face in YFCC-100M. Although the limited search in IBM’s DiF possible with NBC’s tool doesn’t return any telling results for those Flickr user names. it is very likely my face is in IBM’s DiF therefore. I do find a number of pictures of friends and peers in IBM’s DiF that way, taken at the same time as pictures of myself.


Photos of me in YFCC-100M

But IBM won’t tell you
IBM is disingenuous when it comes to being transparent about what is in their DiF data. Their TOS allows anyone whose Flickr images have been incorporated to request to be excluded from now on, but only if you can provide the exact URLs of the images you want excluded. That is only possible if you can verify what is in their data, but there is no public way to do so, and only university affiliated researchers can request access to the data by stating their research interest. Requests can be denied. Their TOS says:

3.2.4. Upon request from IBM or from any person who has rights to or is the subject of certain images, Licensee shall delete and cease use of images specified in such request.

Time to explore the questions this raises
Now that the context of this data set is clear, in a next posting we can take a closer look at the practical, legal and ethical questions this raises.

Some links I thought worth reading the past few days