Open Data

Make Open Data compatible with the Modern Data Ecosystem.

Motivation

Open Data is a public good. As a result, individual [[incentives]] are not aligned with collective ones.

As an organization or research group, spending time curating and maintaining datasets for other people to use doesn’t make economic sense, unless you can profit from that. When a scientist publishes a paper, they care about the paper itself. They’re incentivized to. The data is usually an afterthought.

Combining data from different sources requires the user to reconcile the differences in schemas, formats, assumptions, and more. This data wrangling is time consuming, tedious and needs to be repeated by every user of the data.

The Open Data landscape has a few problems:

Open Data can help organizations, scientist, and governments make better decisions. Data is one of the best ways to learn about the world and [[Coordination|coordinate]] people. Imagine if, every time you used a library, you had to find the original developer and hope they had a copy. It would be absurd. Yet that’s essentially what we’re asking scientists to do.

There are three big areas where people work on open data; at the government level covering thousands of datasets (CKAN, Socrata, …), at the scientific level (university level), and at the individual level where folks who are passionate about a topic publish a few datasets about it. This results on lots of datasets that are disconnected and still requires you to scrape, clean, and join it from all the heterogeneus sources to answer interesting questions. One of the big ways that data becomes useful is when it is tied to other data. Data is only as useful as the questions it can help answer. Joining, linking, and graphing datasets together allows one to ask more and different kinds of questions.

Open protocols create open systems. Open code creates tools. Open data creates open knowledge. We need better tools, protocols, and mechanisms to improve the Open Data ecosystem. It should be easy to find, download, process, publish, and collaborate on open datasets.

Iterative improvements over public datasets yield large amounts of value (check how Dune did it with blockchain data)¹. Access to data gives people the opportunity to create new business and make better decisions. Data is vital to understanding the world and improving public welfare. Metcalfe’s Law applies to data too. The more connected a dataset is to other data elements, the more valuable it is.

Open Source code has made a huge impact in the world. Let’s make Open Data do the same! Let’s make it possible for anyone to fork and re-publish fixed, cleaned, reformatted datasets as easily as we do the same things with code.

This document is a collection of ideas and principles to make Open Data more accessible, maintainable, and useful. Also, recognizing that a lot of people are already working on this, there are some amazing datasets, tools, and organizations out there, and, that Open Data is a people problem at 80%. This document is biased towards the technical side of things, as I think that’s where I can contribute the most.

Why Now?

We have better and cheaper infrastructure. That includes things like faster storage, better compute, and, larger amounts of data. We need to improve our data workflows now. How does a world where people collaborate on datasets looks like? The data is there. We just need to use it.

The best thing to do with your data will be thought by someone else.

During the last few years, a large number of new data and open source tools have emerged. There are new query engines (e.g: DuckDB, DataFusion, …), execution frameworks (WASM), data standards (Arrow, Parquet, …), and a growing set of open data marketplaces (Datahub, HuggingFace Datasets, Kaggle Datasets). “Small data” deserves more tooling and people working on it.

These trends are already making its way towards movements like DeSci or smaller projects like Py-Code Datasets. But, we still need more tooling around data to improve interoperability as much as possible. Lots of companies have figured out how to make the most of their datasets. We should use similar tooling and approaches companies are using to manage the open datasets that surrounds us. A sort of Data Operating system.

One of the biggest problem in open data today is the fact that organizations treat data portals as graveyards where data goes to die. Keeping these datasets up to date is core concern (data has marginal temporal value), alongside using the data for operational purposes and showcasing it to the public.

Open data is hard to work with because of the overwhelming variety of formats and the engineering cost of integrating them. Data wrangling is a perpetual maintenance commitment, taking a lot of ongoing attention and resources. Better and modern data tooling can reduce these costs.

Organizations like Our World in Data or 538 provide useful analysis but have to deal with dataset management, spending most of their time building custom tools around their workflows. That works, but limits the potential of these datasets. Sadly, there is no data get OWID/daily-covid-cases or data query "select * from 538/polls" that could act as a quick and easy entry-point to explore datasets.

We could have a better data ecosystem if we collaborate on open standards! So, lets move towards more composable, maintainable, and reproducible open data.

¹ Blockchain data might be a great place to start building on these ideas as the data there is open, immutable, and useful.

Design Principles

Modules

Packaging

Package managers have been hailed among the most important innovations Linux brought to the computing industry. The activities of both publishers and users of datasets resemble those of authors and users of software packages.

Storage and Serialization

Transformations

Consumption

Frequently Asked Questions

Please reach out if you want to chat about these ideas or ask more questions.

1. What would be a great use case to start with?

I’d say chain related data. Is open and people are eager to get their hands on it. I’m working on that area, so I might be biased.

2. Why should people use this instead of doing their own thing?

If everybody could converge to it, e.g: “datapackage.json” as a metadata and schema description standard, then, an ecosystem of utilities and libraries for processing data would take advantage of it.

3. What is the incentive for people to adopt it?

I wonder if there are ways to use novel mechanisms (e.g: DAOs) to incentive people? Also, companies like Golden and index.as are doing interesting work on monetizing data curation.

4. How can LLMs help “building bridges”?

LLMs could infer schema, types, and generate some metadata for us. [[Large Language Models|LLMs can parse unstructured data (CSV) and also generate structure from any data source (scrapping websites)]] making it easy to create datasets from random sources.

They’re definitely blurring the line between structured and unstructured data too. Imagine pointing a LLMs to a GitHub repository with some CSVs and get the auto-generated datapakage.json.

5. How can we stream/update new data reliably? E.g: some datasets like Ethereum blocks could be updated every few minutes

I don’t have a great answer. Perhaps just push the new data into partitioned datasets?

7. Is it possible to mount large amount of data (FUSE) from a remote source and get it dynamically as needed?

It should be possible. I wonder if we could mount all datasets locally and explore them as if they were in your laptop.

8. Can new table formats play efficiently with IPFS?

Parquet could be a great fit if we figure out how to deterministically serialize it and integrate with IPLD. This will reduce their size as unchanged columns could be encoded in the same CID.

Later on I think it could be interesting to explore running delta-rs on top of IPFS.

9. How to work with private data?

Not sure. Homomorphic encryption?

10. How could something like Ver works?

If you can envision the table you would like to have in front of you, i.e., you can write down the attributes you would like the table to contain, then the system will find it for you. This probably needs a [[Knowledge Graphs]]!

11. How can a [[Knowledge Graphs]] help with the data catalog?

It could help users connect datasets. With good enough core datasets, it could be used as an LLM backend.

12. How would a Substack for databases look like?

An easy tool for creating, maintaining and publishing databases with the ability to restrict parts or all of it behind a pay wall. Pair it with the ability to send email updates to your audience about changes and additions.

13. Curated and small data (e.g: at the community level) is not reachable by Google. How can we help there?

Indeed! With LLMs on the rise, community curated datasets become more important as they don’t appear in the big data dumps.

14. Wait, wait… What do you mean by “Open Data”?

I use it as a generic term to refero to data and content that can be freely used, modified, and shared by anyone for any purpose. Generally alligned with the Open Definition and The Open Data Commons.

Data Package Managers

Computation

Large Open Datasets

Open Data Organizations

Indexes

Open Data Driven Projects

Novel Technologies

Open Source Web Data IDE

After playing with Rill Developer, DuckDB, Vega, WASM, Rath, and other modern Data IDEs, I think we have all the pieces for an awesome web based BI/Data exploration tool. Some of the features it could have:

Could be an awesome front-end to explore [[Open Data]].

Relevant Projects

Datafile

Inspired by ODF, Frictionless and Croissant.

name: "My Dataset"
owner: "My Org"
kind: "dataset"
version: 1
description: "Some description"
license: "MIT"
documentation:
 url: "somewhere.com"
source:
 - name: "prod"
   db: "psql:/...."
pipeline:
 - name: "Extract X"
   type: image
   image: docker/image:latest
   cmd: "do something"
materializations:
 - format: "Parquet"
   location: "s3://....."
   partition: "year"
schema:
 fields:
  - name: "name"
    type: "string"
    description: "The name of the user"
  - name: "year"
  - description: "...."
 primary_key: "country_name"
metadata: "..."

Simple Package Manager Design

Unified Schema Design

Architecture

Architecture

Edit on Excalidraw