How to use DuckDB’s ST_Read function to read and convert zipped shapefiles

Today I learnt …

DuckDB is a (relatively) new RDBMS designed for in-process analytical queries, but really it’s useful for all kinds of things. As a database it’s most similar to SQLite, but DuckDB’s killer feature for me is that it comes with a spatial extension. That makes it really approachable for geospatial data processing — think PostgreSQL + PostGIS but without the hassle of running a server.

DuckDB’s spatial extension provides an ST_Read function that lets you read spatial data from various file formats as if they were database tables. All that’s need to query a GeoJSON file named source.geojson is:

INSTALL spatial; -- You only need to install it once.
LOAD spatial;    -- But you need to load it each session.
SELECT * FROM ST_Read('source.geojson');

Under the hood, ST_Read uses GDAL, which means DuckDB supports a huge number of vector file formats. It also means that DuckDB supports GDAL’s more esoteric features, such as virtual file systems. The /vsizip/ prefix, for example, lets you SELECT from a file within a ZIP file, decompressing it on-the-fly. Shapefiles are often distributed as ZIP files (because a ‘shapefile’ is usually several files), and it can be annoying to need two steps to get to the file. But using /vsizip/ lets you query geospatial data directly from a ZIP file. Here’s how you could query a shapefile named bar.shp within a ZIP file named foo.zip, removing any features that don’t cover a point in Australia:

SELECT
    *
FROM
    ST_Read('/vsizip/foo.zip/bar.shp')
WHERE
    ST_Covers(geom, ST_Point(128.28, -21.69));

DuckDB’s spatial extension includes a COPY function to export geospatial tables to different formats. This works on ST_Read virtual tables, which means you can convert a zipped shapefile to, for example, a GeoJSON file in a single query:

COPY ⟨
    SELECT * FROM ST_Read('/vsizip/foo.zip/bar.shp')
⟩
TO 'some_file.geojson'
WITH (FORMAT GDAL, DRIVER 'GeoJSON');

And of course the SELECT sub-query can be as complicated as you need it to be, filtering out data as you go.