Link rot is inevitable

@da_poling|May 1, 2024 (2y ago)

The web was not designed for permanence. URLs are strings. They point to resources that exist at the discretion of whoever controls the server. There is no protocol for what happens when a resource moves, no contract between the document that links and the document being linked to. When something breaks, nothing tells you.

This is not an oversight. It is a consequence of how the web was designed: decentralized, loosely coupled, optimistic. Those properties are why it scaled. They are also why link rot is not a bug you can patch. It is a structural condition.

Working on large editorial platforms made this concrete. Sites with years of content, actively maintained, would accumulate dead references silently. The CMS had no opinion about links. It stored them as text fields and considered its job done. When a URL changed, because a section was restructured, because a campaign ended, because an external domain expired, nothing downstream was notified. The reference just stopped working.

The standard response to this is a crawler. Build something that traverses the link graph, checks HTTP status codes, and surfaces what is broken. It works. The edge cases are interesting: soft 404s, redirect chains, and rate limiting that makes live endpoints look dead, but they are solvable. Crawling is a tractable problem.

What crawling does not solve is the architecture that makes the problem recur. A crawler is a diagnostic run after the fact. It finds rot that has already happened. The more useful question is why content systems have no model of their own dependencies.

A link in most CMS implementations is indistinguishable from any other string. The system that stores it has no awareness of what it points to, no relationship between the content that contains it and the content it references. Moving a page is a unilateral action. The system does not know what breaks. You find out later, from a crawler, or from a user.

The architectural fix is to treat links as relationships rather than strings, to give the content system a dependency graph it maintains continuously rather than one you reconstruct periodically from the outside. That is a harder problem than crawling, and it requires rethinking how content models are designed at a fairly fundamental level.

Nobody has built that well yet. The reasons are more organizational than technical.