If we’re not careful we’ll lose the web

Best of

The ephemeral nature of the contempory Web means our time will be seen as a second dark age

If you happen to walk down Woodlands Road in Glasgow you’ll come across a statue honouring Bud Neill, a cartoonist from the city. The statue, of Neill’s famous character Lobby Dosser, is truly unique: it’s the only equestrian monument in the world where the horse has two legs.

I’m a sucker for a good fact like that. I’m still waiting for the moment in a pub quiz where I can use that as a correct answer, sitting triumphant while everyone else cries at the unfairness of it all.

Here’s another equally obscure fact just begging to find use as a pub quiz question: Thomas Jefferson is the only two-term US president never to veto a bill passed by Congress. Oooooooooh.

Jefferson was a fascinating and paradoxical man. Despite owning slaves until his death he was vociferously anti-slavery. When drafting the Declaration of Independence he even went to the trouble of criticising King George III for daring to engage in the slave trade:

He has waged cruel war against human nature itself, violating its most sacred rights of life and liberty in the persons of a distant people who never offended him, captivating and carrying them into slavery in another hemisphere, or to incur miserable death in their transportation thither.

You won’t find this condemnation in the final version of the declaration of course — delegates from the slave-owning states of Georgia and South Carolina made sure to remove that. But we know Jefferson intended to include it because copies of his early drafts exist to this day. We can see the annotations, words, and paragraphs he and his contemporaries removed, added, and rewrote. These details illuminate the process so as to reveal the otherwise hidden pressures Jefferson faced, the discussion that must have gone on between the congressional delegates, and the compromises they made.

In contrast, modern authors produce perhaps one or two copies before publication, saving draft over draft in a single digital file, losing an incalculably precious insight into their thoughts and ideas. Blog software and content-management systems allow you to ‘save as draft’, overwriting everything that’s gone before until you finally thrust your carefully-chosen words into public.

Cory Doctorow, an author and journalist, wasn’t comfortable with this so he did something about it. Together with Thomas Gideon he created software called Flashbake that allows writers to automatically save individual drafts every fifteen minutes. Flashbake uses Git, a revision control system, allowing every change between drafts to be stored. No information is lost. More people should be using this, and I urge you to take a look. I did something similar but more prosaic by saving drafts of this article on Github.

The publisher Pragmatic Programmers is equally forward-thinking: as an author you send them your drafts by using a revision control system called Subversion. Like Doctorow and Gideon’s system this saves every change you’ve made, to the point it knows which letters you’ve added and deleted. Once you send those changes the draft is run through their publishing system and you can see how your new draft looks. More importantly every draft you’ve sent them is saved permanently.

A final example is Wikipedia. When an article is changed it doesn’t overwrite the old version but rather stored alongside it. You can see what the article looked like three hours ago, last Wednesday, last month, or even in January 2001. Although a primary aim for this is to combat vandalism (I once saw Ed Balls referred to as Ed ‘Two Balls’ Balls throughout an article; it was quickly corrected) it will also allow future historians to undertake research into the processes behind early online collaborative writing, and see how the world’s largest encyclopaedia came to be. Historians will also be able to see when and where ideas originated, how they formed, and how they changed.

If Wikipedia is still around.

How many web pages — entire web sites, even — disappear every day? When was the last time you saw a ‘Page not found’ message? Not too long ago I’d imagine.

Corporate web sites are rebuilt without a care for their predecessor, blogs are closed down, domains are not renewed. An increasingly vast portion of Western and world culture is written and stored online, yet we care little for its longevity. The British Library, Deutsche Nationalbibliothek, Landsbókasafn Íslands, all demand copies of every book published in their respective countries while we sit and wallow in complacency as swathes of writing comes and goes on the web.

Let’s take a sojourn to 1537 — the 28th of December 1537 to be precise. That day saw the invention of legal deposit: the Montpellier Ordinance was signed into law, specifying that all printers and publishers in France must forward a copy of every newly-published book to Melin de Saint Gelais, librarian to the king at Fontainebleau. If they didn’t they’d be penalised — and more than just a slap on the wrist too.

It took over eighty years for the English speaking world to catch on to the idea. In 1620 Sir Thomas Bodley of the University of Oxford made a private agreement with the Master of the Stationers’ Company, which at the time had a monopoly over publishing. They agreed that one perfect copy of every book should be sent to the university library, known as the Bodleian, although as they didn’t include any penalties for not doing so there were plenty of publishers who didn’t bother.

A further ninety years on and all that changed with the Statue of Anne. Publishers were legally-obliged to send a copy of every book to the nine deposit libraries of Great Britain — one being the Bodleian. The Stationers’ Company lost its monopoly and copyright was given to authors.

It’s clear, says John Gilchrist in the QUT Law and Justice Journal (PDF), that the Statute of Anne wasn’t just there to fill public libraries at private expense but ‘an instrument to gather a full and permanent record of nation’s printed works and a record of all the branches of knowledge contained within those works’.

In other words eighteenth-century England realised something we seem to have forgotten: to save our our culture and knowledge we need to save our written works.

Since then legal deposit, as this system’s known, has expanded so the deposit libraries (now numbering six in the United Kingdom) must receive a copy of all books, pamphlets, magazines, newspapers, printed music, maps, plans, charts, and sound and film recordings. Which makes it all the more surprising that works published on the web aren’t included.

Depositing the web

Just as they were at the forefront back in 1537 so the French are now: since 2006 the Bibliothèque nationale de France has been responsible for the legal deposit of the French web. They’re crawling the French-language web and storing every web site they find. The Internet Archive has done something similar for as much of the web as it can since 1996.

The problem they face is that storing publications today is a far more complicated problem than that which vexed librarians three hundred years ago. A book is a physical object; you have one copy and you have a perfect duplicate of what the author and publisher intended.

An electronic document is far more flexible. It changes continually. An electronic document doesn’t even need to be one coherent thing: one hundred years ago an article in a newspaper was a set of ordered words, but now on the web it may be a set of bytes stored in many tables across a database, presented using HTML, CSS, and images stored on a disk. Most of these components won’t even be stored on the same computer or in the same physical location. Some will be updated while others stay unchanged. How do we quantify what depositing a newspaper article means today?

And there’s one further thing that makes new media utterly different to traditional media, the one thing that makes the web such a profoundly new idea: the hyperlink. The web is exactly that, a web of links, a web of documents — larger than we’ve ever known — intricately connected to one another. Nothing stands on its own anymore. Relationships are fundamental.

You might preserve the content of a web page when you archive it but you lose its relationships with other web documents. When Barack Obama became president of the United States the previous eight years of the president’s web site disappeared. It’s archived somewhere but all the pages on the web that linked to it lost that relationship when the site moved. We end up not with web pages but mere pages, stripped of context, alone like leaves fallen from the tree.

When I link to Barack Obama’s White House I need to know it will always be there. When someone dies their web site must live on. When someone takes a photo the scene they capture must stay captured, forever throwing light on that ephemeral moment. Otherwise in a century’s time this won’t be known as an age of innovators but rather as a second dark age: our ideas, ambitions, wants, achievements, the very things we stand for, lost, permanently. We will lose our own Samuel Pepys or Ann Frank, before we have the chance to discover them.