Favorited AI Policy and Human.json by Claudine Chionh
Favorited Adding human.json to WordPress by Terence Eden

Claudine Chionh and Terence Eden both mention human.json, a data file that lists people and sites you know are written by humans, as opposed to generated by AI. A rekindling of FOAF?

In these days of needing to assume anything you encounter is machine generated unless proven to be human made, we continuously have to apply a Reverse Turing test: do I have enough indications to assume something was created by a human.

When I first wrote a Reverse Turing page I mentioned much the same things as Terence Eden does about vouching for other people to be human authors.

Not sure if having a machine readable file makes the right point here though, ironic as it is. Blogrolls, webrings come to mind too, because Long Live the Author.

One element I think we’d need to contemplate is to not just list, but also provide URI’s to some supporting evidence. Expose the depth of a connection. Only met at a vouching party countersigning your credentials, or two decades of in person and online encounters and proof thereof are different in depth and quality, and may well impact how the Reverse Turing test turns out for others perusing your human.json file.

My website is now part of the web archive in the Dutch Royal Library. It took some experimenting to get it in there. Blogs will be blogs and the amount of links in mine choked the harvester it seems.

Since 2007 the Royal Library has been archiving websites, and now stores some 25.000 websites. My blog, even though it is one of the oldest still maintained in the Netherlands, never was part of that effort. Mostly because it’s not very visible as a Dutch blog, as it is mostly written in English and resides on a .org domain (when I registered zylstra.org, private persons could not yet register .nl domains, only companies could). At an Internet Archive event organised by the Royal Library last year September I asked about archiving and they told me how to suggest my website for archiving.

Late last January I received a message that my website would be included in their archives from now on.

What followed were several test-runs with their harvester Heritrix, which is also used by the Internet Archive. I wondered about how some of my website’s peculiarities would be dealt with by the harvester. Not every posting is listed on my site for instance, although each does have a direct URL. The years’ worth of weekly notes for instance are not listed in this site. Also many postings are never shown on the front page, and if you page through postings on the front page you will never encounter them. This is true for categories of posts like books, photos, and day to day topics. I discussed this with the web-archivist, who ran some tests. My week notes seemed to be included, but the pagination of the category of day to day stalled out at 180 pages, although there were more still.

To my surprise they also ran into volume limits. Apparently because of ‘bycatch’, things they archive from other sites because I reference them or embed them. In the past few years I have stopped embedding things, like photos, except for my slides, which are hosted on a separate domain I have registered. While it was normal that a site’s additional catch is larger than the site itself, for my site it was very different from what they were used to.
First they limited bycatch to 20GB in a test, and they ran out of space, then they set it at 40GB in a test, and still ran out of space. Raising the limits further did not help. In the end they decided to harvest just what is on my zylstra.org domain and not include any bycatch at all. Which is completely fine by me, precisely because I’ve made the effort to bring all kinds of external content ‘home’ to this domain.

Nevertheless it did surprise me that bycatch turned out to be a problem, as they are using a tool the Internet Archive itself uses too. I asked for some examples of the bycatch. They told me it wasn’t even possible to dump a URL list from the bycatch into a spreadsheet as it hit the maximum number of rows (around 65k iirc). I did get some of the URLs that contributed bigger volumes of bycatch. To my surprise I did not even recognise the links, except one.

One was obvious, 2800 attempts to harvest a page on live.staticflickr.com, as I link a lot to my Flickr hosted images, although I no longer embed them but have local versions on this domain.
Others were not obvious to me at all, theguardian.tv, vp.nyt.com and various content delivery networks. I link to none of them in this site. I do link to The Guardian, about 100 times, and to the NYT about 40 times, and I suppose if the harvester follows those links it will find additional material there that explains the bycatch more fully, if it harvests all the targets I link to too.

If that is the case, that it harvests everything I’ve linked to, then it is the long history of this blog that is the issue and makes the harvester hit its limits.

There are some 20.000 external links in this blog’s articles, as far as I can quickly estimate based on a full content export I made this week.
It basically means that if the harvester attempts to harvest all those links and what resources they include, it adds a number of pages to the archive, roughly equivalent to the current archive itself.

A weblog embraces what the world wide web is, a bunch of links to other websites. The name weblog says it. A web-log is a curation hub for web readers, pointing out other interesting stuff, and not trying to keep you here too long. Over 23 years of blogging yielded some 20.000 links to other websites. In terms of linking a blog becomes the web itself as much as it becomes its author’s avatar in terms of its content given enough time.

From now on my site will be updated in the Royal Library’s archives every year on March 5th.


The facade of the Royal Library in The Hague, photo by Ferdi de Gier, license CC-BY-SA

After being informed about the intention of the Royal Library to archive my website, I wondered how some of the aspects my site has may affect what is being collected.
Specifically:

  • Most of my postings are kept away from the front page but end up in specific categories. These postings do show up in monthly archives and overview pages like for a tag or category.
  • Some of my postings are unlisted in the site, yet are publicly available. Mostly these are postings I originally only shared through RSS, such as my week notes. They are not in overviews, don’t show up as search results, but have public URLs, and you can navigate to them if you click next / previous post on their surrounding posts in the timeline.

The crawler that will be used for the archiving is Heritrix, which is also used by the Internet Archive itself.
A quick test of some posts from both of the two types above shows they are likely not in the internet archive. I mailed the Royal Library to ask how Heritrix may or may not deal with my site’s quirks. Or perhaps I can generate a complete site map and make that available?

I think I’ll put this up on the front page 😉

Yesterday I received word from the Dutch Royal Library that this weblog will be included in their digital archives from now on.
The Dutch Royal Library started archiving selected websites in 2007. At one point, years ago, they had a pilot project to include a range of Dutch weblogs. My blog fell outside their scope of perception then, because I write mostly in English, and because my site lives on a .org domain, not a .nl domain. When this blog started it wasn’t possible for individuals to register .nl domains, you had to be registered as a company for that.

Last September I attended a session at the Royal Library in The Hague, where also Brewster Kahle presented the European efforts of the Internet Archive, and the collaboration between these two organisations was discussed. There I learned that in order to be considered for archiving I could now actually submit a request to be considered. Which I did. With their decision now taken.

Currently the Dutch Royal Library archives some 25.000 websites, out of the 10 million or so existing websites in the Netherlands, i.o.w. just a quarter of 1 percent.
My blog is probably one of a small number of personal blogs in the Netherlands that has resided on the same URL this long (23 years) and is still active. Other bloggers from way back when and before I started, like Frank Meeuwsen, have switched domain names several times over the years. Frank’s blog has been included in the archive since 2018.

Conservation in the digital archive is not a recognition, as the Royal Library aims to preserve a representative subset for future research purposes. A recent wave of additions covered e.g. all kinds of web initiatives from during the pandemic, preserving a window on that period.
It is however a way of shaping digital longevity, something I mentioned here some years ago. Then I suggested submitting collections of postings as books with their own ISBN numbers. That still is a good route I think. Being part of the digital archive is definitely a step towards digital longevity too.
I do like that my site is now included in the archives of the Dutch Royal Library.

In reply to Responses by Jeremy Keith

Important to me is that people leave longer traces online. Longer traces make it more likely others stumble across them, and follow the trail towards new conversation and interaction. In that sense the webmentions of likes and reposts, while in themselves not important, do allow others to find people who cared to respond in a tiny way to something you are reading. If you enjoy the reading you may well be interested in finding the others that enjoyed reading the same thing. So you may follow them in turn, or be on the look out for a conversation with them. Used to be I always checked every commenter on a blog post I commented on, to see if they blogged themselves and if I wanted to follow their feed. Showing the webmentions, likes and whatnot are a means of discovery.

Right now I accept these likes and shares as webmentions. I display a tally of each kind of response under my posts. But I’m not sure why I’m doing it. I don’t particularly care about these numbers. I’m pretty sure no one else cares either.

Jeremy Keith

I use a personalised feedreader (running on top of a self-hosted instance of FreshRSS‘s API that handles the RSS subscriptions) since about 4 years.
My feedreader allows me to interact with the Web, not just read it. I can post to this blog (and a few other websites) directly from it and keep reading my feeds. Same for adding an annotation to Hypothes.is, and for adding a note in markdown to my filesystem in the folder where Obsidian lives.

Recently I mentioned I want to make my habit of annotating web postings in my Hypothes.is easier to keep up.
As I wrote then:

… currently from within my feedreader I can post to either my blog or to Hypothes.is, but not both. I want to change that, so that the same thing can serve two purposes simultaneously.

I now have adapted my feedreader interface and related scripts to do just that.
It can post to a few websites AND to hypothes.is AND to Obsidian all at the same time now. It used to be either just one of the sites, hypothes.is or Obsidian. Posting to both hypothes.is and Obsidian simultaneously won’t happen a lot in practice as my hypothes.is annotations already end up in Obsidian anyway. I use the saving to Obsidian mostly to capture an entire posting, where I use hypothes.is in my feedreader to just initially bookmark a page so I might return later to annotate more. The current version of the response form in my feedreader is shown below.

One element I added to the interface that I haven’t coded yet in the back-end: posting to my personal and/or my business Mastodon accounts. [UPDATE I did that now too]. When Now that is done, I can write to all the places I write the web, right from where I read it, as in Tim Berners Lee’s original vision:

The idea was that anybody who used the web would have a space where they could write and so the first browser was an editor, it was a writer as well as a reader. Every person who used the web had the ability to write something. It was very easy to make a new web page and comment on what somebody else had written, which is very much what blogging is about.

Tim Berners Lee in a BBC interview in 2005