I have a little over 25 years worth of various notes and writings, and a little over 20 years of blogposts. A corpus that reflects my life, interests, attitude, thoughts, interactions and work over most of my adult life. Wouldn’t it be interesting to run that personal archive as my own chatbot, to specialise a LLM for my own use?

Generally I’ve been interested in using algorithms as personal or group tools for a number of years.

For algorithms to help, like any tool, they need to be ‘smaller’ than us, as I wrote in my networked agency manifesto. We need to be able to control its settings, tinker with it, deploy it and stop it as we see fit.
Me, April 2018, in Algorithms That Work For Me, Not Commoditise Me

Most if not all of our exposure to algorithms online however treats us as a means to manipulate our engagement. I see them as potentially very valuable tools in working with lots of information. But not in their current common incarnations.

Going back to a less algorithmic way of dealing with information isn’t an option, nor something to desire I think. But we do need algorithms that really serve us, perform to our information needs. We need less algorithms that purport to aid us in dealing with the daily river of newsy stuff, but really commodotise us at the back-end.
Me, April 2018, in Algorithms That Work For Me, Not Commoditise Me

Some of the things I’d like my ideal RSS reader to be able to do are along such lines, e.g. to signal new patterns among the people I interact with, or outliers in their writings. Basically to signal social eddies and shifts among my network’s online sharing.

LLMs are highly interesting in that regard too, as in contrast to the engagement optimising social media algorithms, they are focused on large corpora of text and generation thereof, and not on emergent social behaviour around texts. Once trained on a large enough generic corpus, one could potentially tune it with a specific corpus. Specific to a certain niche topic, or to the interests of a single person, small group of people or community of practice. Such as all of my own material. Decades worth of writings, presentations, notes, e-mails etc. The mirror image of me as expressed in all my archived files.

Doing so with a personal corpus, for me has a few prerequisites:

  • It would need to be a separate instance of whatever tech it uses. If possible self-hosted.
  • There should be no feedback to the underlying generic and publicly available model, there should be no bleed-over into other people’s interactions with that model.
  • The separate instance needs an off-switch under my control, where off means none of my inputs are available for use someplace else.

Running your own Stable Diffusion image generator set-up as E currently does complies with this for instance.

Doing so with a LLM text generator would create a way of chatting with my own PKM material, ChatPKM, a way to interact (differently than through search and links, as I do now) with my Avatar (not just my blog though, all my notes). It might adopt my personal style and phrasing in its outputs. When (not if) it hallucinates it would be my own trip so to speak. It would be clear what inputs are in play, w.r.t. the specialisation, so verification and references should be easier to follow up on. It would be a personal prompting tool, to communicate with your own pet stochastic parrot.

Current attempts at chatbots in this style seem to focus on things like customer interaction. Feed it your product manual, have it chat to customers with questions about the product. A fancy version of ‘have you tried switching it off and back on?‘ These services allow you to input one or a handful of docs or sources, and then chat about its contents.
One of those is Chatbase, another is ChatThing by Pixelhop. The last one has the option of continuously adding source material to presumably the same chatbot(s), but more or less on a per file and per URL basis and limited in number of words per month. That’s not like starting out with half a GB in markdown text of notes and writings covering several decades, let alone tens of GBs of e-mail interactions for instance.

Pixelhop is currently working with Dave Winer however to do some of what I mention above: use Dave’s entire blog archives as input. Dave has been blogging since the mid 1990s, so there’s quite a lot of material there.
Checking out ChatThing suggests that they built on OpenAI’s ChatGPT 3.5 through its API. So it wouldn’t qualify per the prerequisites I mentioned. Yet, purposely feeding it a specific online blog archive is less problematic than including my own notes as all the source material involved is public anyway.
The resulting Scripting News bot is a fascinating experiment, the work around which you can follow on GitHub. (As part of that Dave also shared a markdown version of his complete blog archives (33MB), which for fun I loaded into Obsidian to search through. Also for comparison with the generated outputs from the chatbot, such as the question Dave asked the bot when he first wrote about the iPhone on his blog.)

Looking forward to more experiments by Dave and Pixelhop. Meanwhile I’ve joined Pixelhop’s Discord to follow their developments.

21 reactions on “ChatPKM Or Your Personal Stochastic Parrot

  1. @ton At some point I want to work with some product to do something similar with my own work. After all, I have almost 40,000 OLDaily posts (and a similar number of photos, plus a good number of audio and video recordings) so an AI would have a lot to work with. Who knows: digital Stephen might be even better than the real thing!

  2. @ton When I saw the Obsidian Smart Connections plugin this was exactly what I thought: If I could trust it (=self host it) this would be amazing.I don’t think that LLMs operating on the internet corpus are that helpful, but if it was referring to my notes that could actually be of great help. E.g. “What notes already reference this idea? Are there connections I’m missing? Where’s a blind spot?”https://github.com/brianpetro/obsidian-smart-connections
    GitHub – brianpetro/obsidian-smart-connections: Chat with your notes in Obsidian! Plus, see what’s most relevant in real-time! Interact and stay organized. Powered by OpenAI ChatGPT, GPT-4 & Embeddings.

Comments are closed.