Friday 13th March, 2026
Weeks 894 and 895
Weeks beginning Monday 2nd and 9th March.
We’ve spent the past two weeks continuing to work on the GOV.UK publishing project with dxw. Specifically focussing on building prototypes of tools that could be used to help improve the health of the Topic Taxonomy.
Extracting new Topic taxonomy taxons
James has continued exploring the idea of extracting new child topics from leaf taxons (i.e. those with no child taxons below them) that contain more than the desired 300 documents. He’s built on the work started in week 893 by testing variations of the topic generation algorithm and by creating some static HTML pages that allow the output to be explored more easily. There’s lots of information in the repository if you’re interested in more detail. We’ve shared this with one of the information architects at GDS so we’ll wait to see what they make of the suggestions.
Suggesting Topic taxonomy taxons to publishers
Chris and I have been exploring how we might suggest Topic taxons when authors are publishing on GOV.UK. We’re testing this by finding similar documents to the one being authored and suggesting the Topic taxons associated with those similar documents. We’re using RubyLLM with OpenRouter and the qwen3-embedding-4b model to generate embeddings for most of the content published on GOV.UK in 2025. We’re then using sqlite with the sqlite_vec extension to perform a similarity search to find the 5 documents that are most similar to the source document. We finally return the Topic taxons associated with those similar documents so that we can evaluate whether they look like sensible suggestions.
We ran into a problem where some of published documents were exceeding the context window of the embedding model so we’ve started to play with the Tokenizers Ruby Gem and the appropriate qwen3 tokenizer to check the tokenized length of the content before sending it for embedding.
We’re currently generating the output of this as a static site using the same pattern we introduced in the national applicability experiment where we use Rake’s file tasks to process each source document into an html page.
This has the advantage of allowing us to make changes downstream of the embedding step without having to also regenerate all the embeddings.
It’s looking promising but there’s still a bit more work to do before we can start showing this to other people to get some feedback.
In non-GDS news:
- Chris, James and I spent an enjoyable and productive day at Space4 last Friday working together in person.
- We’ve had a few more conversations about possible projects later in the year.
- Chris has added a page about CoTech’s work to the CoTech website.
- Chris and I have both agreed to take on new offices from April as both of our current spaces are unavailable from the end of this month.
Until next time.
– Chris
If you have any feedback on this article, please get in touch!
Historical comments can be found here.