Blogging with jupyter notebooks

Keywords: index page generation, markdown cell, search page, html generation, jupyter notebook, hugo, script, create, server, update, work. Powered by TextRank.

After I read Stephen Wolfram's very nerdy blog post about his productivity tools I was very intrigued by the idea that he was using Wolfram notebooks for almost everything he was producing. From documentation for his business to blog posts. The reason I find this a really cool idea is that you can include runnable in a post with it's output formatted as data frames or a mathplotlib graph. Jupyter is pretty flexible in the way it will allow you to produce output so you can go pretty wild like rendering your own SVG's etc.

Migration from hugo

I had been using hugo for a while now and everything I wrote was in markdown and Jupyter has built-in support for markdown. The only thing I needed to do was write a small python script that would embed the markdown files from hugo into jupyter notebooks (which are JSON and easy to understand). The only tricky part here was that hugo uses front matter to keep the metadata for posts while jupyter doesn't have such a feature. I ended up with creating a raw cell that holds this metadata and the second cell is the markdown cell that has the content.

Notebook to HTML

I had a couple of different options for generating the HTML from notebooks.

Use the built-in HTML generation.
Get the markdown and export it to hugo or another markdown parser for the HTML
Use a different notebook viewer

The built-in HTML generation is the easiest and most straight forward option but the resulting HTML was huge. A notebook with no image attachments was around 575Kb. I took a look at the templating system to see if I could optimize it but there was a lot of css compiled into the page and it looked pretty complicated so I didn't want to invest too much time into exploring this.

The markdown export was also a good idea but the main issue with this approach was processing the attachments. I tried a tool called nbdev which exports all the attachments to files but the links in the markdown document were not converted to the proper URL. It could be working for inline image generated by code cells but it will not work for attached images.

I also considered writing a script to extract the attachments and update the markdown links myself but that's also a lot of work and I would also have manage all the different cell types etc.

Then I explored the route of in-browser rendering. And not surprisingly there is a project called notebook.js which renders the notebook given the ipynb source. I ended up choosing this project and embedding the source file into the HTML.

Index page generation

There is no "index page" in the world of jupyter notebooks. You get a file explorer like interface and you choose your notebook. I needed something that displays the title, description and tags of the articles in a list. So I put together a script that traverses the notebooks and extracts the metadata from the first cell to create the index page.

Tag pages and menu

One of the great features of the hugo theme I was using before was the tag pages. But I couldn't be bothered to modify the theme to support a navigation bar with the tags. So I implemented both these features in the script I wrote for the notebook extractor.

Microblog imap

I recently shutdown my mastodon server because I could not justify paying 50$ for a server each month. But I don't like to post the occasional micro update and using a notebook for a short update is not practical. So I created a script that will monitor my email account for a certain email format and use it to create a new notebook with the content and publish it to the micro section on my website.

Search

Search is always complicated with static sites. When there is no server side code to power the search it stops making sense to create a reverse lookup index. So the method I came up with is to create the a mapping of the source for the posts to the file names and store that as JSON data to be processed by the client. So when you go to the search page you are basically downloading the entire text content of the posts in a single huge JSON along with the lemmatizations of the tokens in the body. Something like this

{
...
"implement": ["implemented", "implements", "implementing"], 
"pay": ["paying", "pays", "paid"]
...
}

Then lemmatizing the input query to get the extended versions of the term and running a search for all these terms seems to yield good results and is fast enough. E.g if you were to search for "paid" the process would get the stem as "pay" then augment the search to include "paying", "pays", "paid" and return all the docs that contain these words too.

The ordering of the search results is something that needs more work but the there aren't that many documents to it's not that big of a deal.

Having a powerful environment that can evaluate and execute code will be great for technical writing!

for i in range(3):
    print("More power!")

More power!
More power!
More power!

Metadata

878 words