Building Open Data Portals in 2024

Wed Mar 20 2024

If you look into the Open Data ecosystem you’ll realize tools and approaches we use there are different from the ones we use at the data teams of small to medium companies.

I’m sure these differences are there for many reasons, the main one being that open data is not a technological problem to begin with. That said, I wanted to explore the idea of building open data portals using some of the tools, libraries, frameworks, and ideas I’ve been using and loving in the last few years.

The Pattern

The gist of the idea I experimented with is:

  1. Rely on open source tools (Python, DuckDB, Dagster, dbt) and formats (Parquet, Arrow).
  2. Use declarative and stateless transformations tracked in git.
  3. Split the workload in two phases; build and serve ¹.
    1. Build in your laptop, in GitHub Actions, or in a big server.
    2. Publish artifacts as static files.

The result of following these ideas is Datadex, an open-source, serverless, and local-first Data Platform that aims to improve how small to medium communities collaborate on Open Data. It is not a new tool itself, only a pattern showing an opinionated bridge between some of the existing ones (DuckDB, Dagster, dbt, Quarto,, …).

This pattern turned out to work very well and I’ve been even using it in a couple of production ready Data Portals (1, 2) since then (both public and private). Let’s take a look at some of the benefits.


What’s Next?

I’m planning to continue exploring this pattern and building more portals with it in the next few months. There are a few things I’d like to figure out, like building incremental models or how to best surface the available datasets and transformations to the users.

I’m also looking for other projects that are following a similar approach to learn from them. Here are a few that have inspired me:

If you’re working on something similar or have any feedback, I’d love to hear from you!

¹ Or as Jake Thomas puts it:

  1. Build the thing using generalized tools
  2. Serve the thing using specialized tools for a particular workload

You can tune and fiddle with basically everything this way - cost, performance, availability, etc.

← Back to home!