Is there such a thing as the perfect data format? No, of course not, but does anything come close? Yes. Trusty old comma-separated values, or CSV.
CSV gets a lot of flak and I think it’s due a little TLC. It doesn’t excite anyone, it’s unfashionable, and it’s old technology — these are all good things for a data format, where you don’t want fast-changing fads to get in the way of data communication. Yes it has its blemishes, but who doesn’t? It’s an excellent fit for statistical data, so do away with the trouble of finding that perfect format and demand CSV for six supremely practical reasons:
CSV isn’t proprietary. CSV has existed for decades and no-one owns the format. You needn’t worry about paying to use it or buying proprietary software to open and save it. Every spreadsheet application supports it and since CSV is open and unchanging, every spreadsheet application will continue to support it for a long time.
Excel supports CSV. Whether we like it or not much of the data that comes from governments, statistics agencies, and companies is stored in Excel spreadsheets, and while these are theoretically machine-readable they tend towards an ambiguity and complexity that’s difficult for computer programs to understand. The older and more widely-used Excel formats are proprietary (newer versions aim to change that but haven’t been entirely successful) and contain bugs, macros and formulas abound, pie charts are embedded all over the place, and the data hierarchies created by its users (I include myself here) can often be ambiguous and hard for a computer program to comprehend. Many of these problems are solved by saving a spreadsheet to CSV, and either you or your source can convert an Excel spreadsheet to a CSV with a few clicks click of a mouse button.
CSV and non-technical people are friends. You’re not likely to be able to demand that data is provided in a particular format, and you’re even less likely to be able to demand that wonderful format you’ve invented. You’ll be lucky to get Excel documents. So asking for CSV is a good bet and risk-free. People can understand it and non-technical staff can make it for you.
CSV is tabular data. If you want to keep the data permanently, or if you’re going to do any serious data manipulation, you’re almost certainly going to put it in a relational database. CSV is very well suited for this because its structure is identical to a database table. It won’t be in third normal form, but it will be easy to convert it into third normal form, and it’s easy to (programmatically) pivot if you need to.
CSV is incredibly easy to parse. CSV is unusual in that no formal specification exists for the format but that doesn’t mean you’ll have difficulty parsing it with a computer program. The closest thing to a spec is RFC 4180; its definition of the format runs to seven bullet points and just over 300 words. And you’ll be hard-pressed to find a programming language that doesn’t come with a CSV parser built in.
Tim Berners-Lee likes it. ‘Save the best for last’, as the saying goes, and this one’s a belter. Tim Berners-Lee, the man who invented the Web, has a five-star system for open data, and using CSV immediately gets you three stars: by making your data ‘available as machine-readable structured data […] plus non-proprietary format (e.g. CSV instead of Excel)’. Getting the fourth and fifth stars is more difficult (it involves a lot more theoretical heavy-lifting) but getting three stars from Tim Berners-Lee can only be a good thing.
CSV isn’t perfect: the most obvious downsides are its lack of support for metadata and character-encoding. Without metadata to describe the columns, everything is a string. If you do want metadata for your CSV you’ll either need to store it elsewhere — probably on a publicly-accessible server — or squeeze it into the data file itself in an ugly fashion1.
The first idea is great if done correctly. To paraphrase Tim Berners-Lee, if you generate a small, separate, metadata file for each datafile the results can be harvested and, like the data itself, distributed and harvested as linked data. Any open dataset can be registered at the Datahub, data.gov.uk, and data.gov, among others.
But what’s more likely is that the metadata will be dumped at either the beginning or the end of the CSV file as if it were a second embedded set of CSV keys and values, and it will cause you some minor trouble.
There’s also the perennial problem of character-encoding. A CSV file has no in-built way to describe what character-encoding it uses, so you’re out of luck unless it’s been downloaded from a server that sends a Content-Type header — and even that shouldn’t be trusted. Instead, resign yourself to asking for a particular character-encoding and cushioning yourself with a heuristic. Once you’ve done that you then need to parse (at least some of) the values in each column to decide whether they’re strings, numbers, boolean, and so on.
But don’t let those issues put you off: as Winston Churchill was once overheard saying, CSV really is the least worst data format. It provides a format that is both programmatically easy to read and simple for non-technical people to manage. It might not be perfect but it comes as close as is practically possible.
This article originally appeared on the DataMarket blog.