The Darth Mall a personal website

On the Importance of Stable IDs

Published
Updated
Tagged
Web

First off, let me apologize to all of you subscribed to the RSS feed. I was “tidying up” around my website over the past few days: removing unwanted plugins, eliminating filters, and flattening the site structure a bit by removing the /weblog and /notes directories from my URLs. As a result anyone subscribed to the RSS feed was inundated with “new” posts from me. Feedbin only showed me four new ones, but a friend said they had 70-ish (probably every post on the site). Here’s what happened…

Atom feed copy-pasta

I’m a big booster of RSS, so having an RSS feed was MVP1 for me when I was setting up my website. I copy-pasta’ed the example feed template from the Eleventy documentation into my project and moved on. I didn’t know much about the various feed formats, and I had very little interest in learning about them. See if you can spot the issue with the example template that bit me this weekend.

Did you find it? In the sample template, the URL for the entry is used as its <id>.

So when I moved all of my posts, the URLs changed, and thus the IDs for each entry. And when feed readers the world over checked in with my feed, they found a ton of “new” posts. Sure, they had the same title, text, and date for posts that had already been viewed, but they had new IDs, which means they are different posts.

I’ve monkeyed with my feed template a few times here and there over the years, so I was aware that the URLs were being used as the entry ID. It seemed eminently reasonable to do this. After all, you can’t have two different pages at the same URL, so the URL ought to be a perfectly serviceable way to uniquely identify each entry. Hell, even the W3C’s Intro to Atom shows the page URL being used as the ID. But, as my surprise re-release of my entire back catalog demonstrated, you don’t want to identify each entry solely within the context of the feed, but also throughout the history of the feed so that feed readers can tell the difference between an old entry and a new one (and an old entry that’s been updated).

The funny thing about the W3C “Introduction to Atom” is that it uses the page URL for the <id>, but it links to a guide for creating good Atom IDs and the first piece of advice in the linked guide is don’t use the URL.

Actually, cool URIs do change

“Everyone” “knows” that cool URIs don’t change, right?2 So really, it’s my fault for moving pages on my website, right? Well, except that I did all the Right Things when I moved these pages so that the old URLs would continue to work. An awful lot of that old W3C page is devoted to explaining that you can, in fact, move things around on your website without changing the URLs. For example, did you know that you can use CGI scripts without having to keep them in a cgi or cgi-bin directory‽3

Setting aside that I do not completely subscribe to this pronouncement about the changing of URIs — a topic for another time, perhaps — I did set up redirects for all of the old URLs so that anyone with a link or a bookmark to one of those pages would not suddenly find themselves in 404-land. So, at least in the sense that apparently mattered to Sir Tim, my URIs didn’t change (and I would argue are significantly cooler without the /weblog and /notes directories in the path).

The thing that’s easy to miss in the W3C’s intro to Atom <id> example, is this (emphasis mine):

Identifies the entry using a universally unique and permanent URI.

The ID needs to be a permanent URI. I’d argue that a URL (a Universal Resource Locator) is a poor URI (a Universal Resource Identifier) in this case, even though a URL is one type of URI.4 Thanks to HTTP and its many redirect codes, I can move things about on my website all higglety-pigglety while ensuring that every old URL continues to point to the right page. It’s easy, with a few regular expressions, to keep Sir Tim happy while also changing the canonical URL of any page on any website. So maybe we shouldn’t be using these things for IDs that are meant to be permanent…

Fool me once…

So now I’ve included an id field in the front matter of all of my posts which can be used for the entry ID in my feed. For existing posts, I made sure that it was the same as the IDs currently in the feed, so that hopefully we won’t have a repeat of the aforementioned Feed Fiasco. For future posts, I’m using a tag URI.5 For the “specific” part of the URI I’ve decide to use an md5 hash of the original file slug. You could use the file slug directly — that is perfectly valid — but I feel like the temptation would be too great to change that part of the URI if I ever ended up renaming the file, so I hide the original slug away in an md5 hash to eliminate that temptation. Now the page’s ID is just a string of meaningless characters that I won’t be tempted to ever change.

To save myself a little trouble (too late), I created a text expansion in Espanso from :tag to the URI that let’s me enter the file slug and the authority name in a form and does all the hashing and date calculation for me.

This may seem like an overreaction to the problem, but having had my feed reader filled up with “new” old posts a couple of times in the past (for probably the very reason we’ve been discussing), and now having caused this problem for all of you, I’m keen to avoid relying on dynamically generated IDs for my feeds. The only way I can really guarantee that the entry ID is stable is if I set it and it only ever changes if I change it. If I derive it dynamically from anything — the URL of the page, the published date — there is always the chance that it could change by accident because I’m changing something seemingly unrelated.

In theory, something like the published date should never change. If I have a major update to a page about which I want to notify readers and user agents alike, I add an updated field with the date of the change so that the original published date remains in tact. But I can’t say with certainty that I will never want to change the published date, so even that is not a reliable source of a permanent ID for my feed entries. So a hard-coded ID it is.

Wow, are you still reading?

It’s tempting to call this an object lesson in the dangers of copying and pasting code from the web, but I don’t think that’s really the issue here. I knew the URLs were being used as IDs in my Atom feed — I’d read the template carefully enough to understand what it was doing — so I could have recognized the risks of moving a bunch of pages on my site if I’d stopped to think it through. But I didn’t.

Rather, I think the lesson for me is do not create arbitrary dependencies in your data in the name of automation. There really is no reason why the ID of a post should change just because its URL changed. Same goes for the title, or the published date, or anything else. The ID really is its own thing, and it should probably not be derived from any other properties. Absolutely everything about an entry in the feed can change — up to a point — while still remaining conceptually the same entry, so the ID should be completely decoupled from everything else.


Footnotes

  1. Minimal viable product ↩︎

  2. Is my sarcasm coming through? ↩︎

  3. Raise your hand if you remember CGI scripts. ↩︎

  4. At this point I find myself hoping that whatever incarnation of this website you happen to be reading has a very legible body font so that it’s easy to see the difference between URL and URI, otherwise I’m sure you must be very confused. ↩︎

  5. Because the ID does have to be a URI, so you can’t — as I initially planned to do — just chuck a UUID in there and call it a day. Fortunately, I decided to run my freshly minted ID for this post through an Atom feed validator right before hitting publish, and it squawked at me. ↩︎