Earlier this week I started reading an ebook and was a bit irritated because the book did not show me a table of contents. This seems to be a regular thing in ebooks. Already earlier I have complained here about why ebooks, or perhaps mostly e-readers, make so little use of the affordances of digital files.
ePUB files are really XML in zipped archives. Since I left Amazon and the Kindle reader behind, all my ebooks are ePUB files. XML means that the files are machine readable and highly structured. That opens up possibilities to manipulate them.
I used Claude Code to ask a few questions about ePUB files and how they are treated by e-readers. E-readers deal differently with the information in an ePUB file. They may load a table of content into a local database, and use that to allow navigation, or ignore various pieces of information in the XML altogether.
For fun, I asked Claude Code to check the XML file of the ebook I was reading earlier this week, to see if it actually contained a table of contents that was just not shown to me in my reader. Turns out it did.
I also asked it, if it would take a lot to extract a table of contents from an ebook. It doesn’t, so I now have a first script that finds the table of contents if present, or builds one from the headers in the ePUB’s XML if not. The php script saves it to a markdown file that I can then use in my book notes, to group my thoughts and annotations.
In my Kobo reader, and in my Calibre reader the ToC information that the ePUB file provides outside the regular content of the book (NCX or xhtml), is accessible through the reader’s interface, but not part of the reading experience itself. I generally like my ToC to also be presented in the book, like it is in a paper one, and I actually prefer it not at the start as is usual but at the end, near notes, references, and literature lists, to have all the book’s metadata together to glance at. For that a ToC must be not separate from the book’s content, but within it. It would need to be in the ‘spine‘, the part that is presented for reading by readers.
If I annotate or highlight in a book, those are kept by an ereader separate from the book and refer to specific points inside the XML (through canonical fragment identifiers, CFI). You can alter an e-book, it’s XML after all, but that would shift the position of content fragments, and existing pointers from annotations and highlights would then point to the wrong lines in a book.
So if I add a ToC, grabbed from the existing metadata or constructed, inside an e-book, my preference to having it at the end is actually useful. Because if I add it to the end, it will not shift anything I may have annotated or highlighted already, messing up the pointers in the annotation file.
Next to extracting a ToC I’m also thinking about extracting other meta-information (like indexes, references, lists of images or tables) but a first glimpse into some ebooks suggests that those are not usually listed in the Manifest of an ebook, so would have to be constructed from clues inside the book.
However it will help me read non-fiction non-linearly if I could extract such things, e.g. the figures and tables present. It seems to me a number of such steps should be straightforward from the structure of an ePUB file, others need a parser to extract the right information and shape in a useful form, but still can be done with regular scripts (e.g. show me the first and last two paragraphs of a chapter to get a notion what it talks about), yet others do need a (local) LLM, e.g. to summarise each section of a book separately. I’ll see how far I can get, and learn about the ePUB format along the way, with deterministic code first to extend my personal and local toolkit on my computer.
Update 12-04-2026: I now have a script, that I run in my browser, which allows me to select an ebook from my Calibre library, and then explores it w.r.t. the table of contents, reference and literature sections, and images, and also pulls in the first and last few paragraphs of a chapter (which let’s me explore what a chapter is about, Adler style). All that gets turned into a markdown file that is then put in the corresponding book note in my Obsidian vault using the right template.
