Community Level Open Data Infrastructure

Sat Aug 10 2024

Inspired by Rufus Pollock, I decided to write a post about learnings and what I think will be the future of open data infrastructure, more specifically open data infrastructure at the community level.

As a data engineer working in a small to medium sized company, I started wondering about how we can make it easier to work with open data. At least, making it simpler thanks to the tools and frameworks I used in my daily job.

Since then, I’ve worked on a couple of data portals and been actively participating in various open data communities, projects, and discussions.

Data Portal

Learnings

One of the big things going on in the enterprise side is that working with data is becoming cheaper, simpler, and faster than ever. You can run entire auto-updating pipelines on GitHub Actions, use DuckDB to analyze gigabytes of information, run portable code across many databases, and publish beautiful dashboards with a few lines of code.

I didn’t see much of this being used in the open data ecosystem. It is reasonable as the data field moves quite fast and, honestly, open data problems are mostly people problems more than technical ones. However, there are technical solutions that can help streamline the process.

There are two big levels where people work on open data; at the government level covering thousands of datasets (CKAN, Socrata, …), and at the individual level where folks who are passionate about a topic (like (Simon Willison)) publish a few datasets about it. This results on lots of datasets that are disconnected and still requires you to scrape, clean, and join it from all the heterogeneus sources to answer interesting questions.

What I didn’t see much of was open data at the community level, where a group of people curates and publish data that is useful to their community. Projects like PUDL and OWID’s ETLs are great examples of this.

And in this level, small to medium communities, is where I think the future of open data lies. Here are a few reasons why:

I’d like to think of these kind of portals as barefoot data portals, reusing the term from Maggie Appleton’s Home Cooked Software and Barefoot Developers.

Barefoot Data Portals

Barefoot Data Portals are community-driven, lightweight data infrastructure projects that bridge the gap between individual efforts and large-scale government initiatives. These portals are similar to what you would have in a company, but scrappier.

They are basically a GitHub repository with scripts to collect data from various sources, clean it, and join it, and publish useful datasets and artifacts for that community. Ideally, they are also simple to get started with and expose the best practices in data engineering for curating and transforming data.

They try to manage the glue in the simplest way possible and provide the datasets in a format the community will use. For example, pushing data to a Google Sheet, a CSV file, or uploading them to HuggingFace.

Finally, they also get humans excited about the curated datasets showcasing some potential use cases!

The largest impact of Barefoot Data Portals is in making curation, cleaning and joining datasets a smoother process thus bringing more people into the collaboration process.

Future

We’ve seen a lot of progress in the last few years in terms of tools and infrastructure to work with data. We will probably continue seeing growth in computation capabilities alongside better tools to work with data. One thing I’m excited about is the potential role of WebAssembly compatible modern tooling (like Polars, DuckDB, …) to make “big data” accessible from anyone’s browser.

If you’re excited about data engineering and curious about open data, find a community you’re passionate about and start a data portal. Scrape and clean some data, reach out to a few folks and ask how would they like to consume it and start publishing datasets!

Start a community data portal today - it’s an excellent opportunity to learn and contribute to open data!

← Back to home!