After being informed about the intention of the Royal Library to archive my website, I wondered how some of the aspects my site has may affect what is being collected.
Specifically:

  • Most of my postings are kept away from the front page but end up in specific categories. These postings do show up in monthly archives and overview pages like for a tag or category.
  • Some of my postings are unlisted in the site, yet are publicly available. Mostly these are postings I originally only shared through RSS, such as my week notes. They are not in overviews, don’t show up as search results, but have public URLs, and you can navigate to them if you click next / previous post on their surrounding posts in the timeline.

The crawler that will be used for the archiving is Heritrix, which is also used by the Internet Archive itself.
A quick test of some posts from both of the two types above shows they are likely not in the internet archive. I mailed the Royal Library to ask how Heritrix may or may not deal with my site’s quirks. Or perhaps I can generate a complete site map and make that available?

I think I’ll put this up on the front page 😉