Atmospheric Data Portals
Thu Mar 26 2026I wrote this quickly more as a sketch for people already familiar with
atprotoand the open data discovery challenges than a fully self-contained post.
This post can be seen as a continuation of my Open Data wranglings. Check them out if you are interested in some ideas to solve issues at earlier stages of the Open Data pipeline.
In the last couple of days, I’ve come across two interesting projects that are working on making “datasets discoverable” using the AT Protocol: Matadisco and atdata. This is something I’ve been thinking about for a while, and I wanted to write down some thoughts and ideas!
The AT Protocol has a real chance to improve the main issue these projects point to: finding useful data is hard. A big part of the difficulty comes from the current state of things: isolated portals, weird APIs, lack of reputation and usage signals, … Turns out, data discovery is also a people problem! And the best hammer protocol we have for this kind of social problem is indeed the AT Protocol.
The beauty of designing data portals on top of atproto is that we get packaging and indexing at the same time, relatively for free. Until now, data package managers have had to deal with both on their own 1.
So, how would I do it? Here are some ideas I haven’t seen in these projects and think are interesting.
- Take inspiration from existing flexible standards like Data Package, Croissant, and GEO ones for the core fields. Start with the smallest shared lexicon while leaving room for specialized extensions (sidecars?).
- Split datasets from “snapshots”. Say,
io.datonic.datasetholds long-term properties likedescriptionand points toio.datonic.dataset.releaseorio.datonic.dataset.snapshot, which point to the actual resources. - Add an optional DASL-CID field for resources so we “pin” the bytes.
- Core lexicon should be as agnostic as possible! That means:
- Storage agnostic
- Datasets can have multiple snapshots with different
URIs. These can be mirrors, updates, forks, partitions of the dataset… They can also be produced by anyone, so I can back up some dataset on IPFS and provide that back. - Snapshots point to “resources”, which are
URIs and perhaps an extraCID. E.g.,s3://owid/2021-04-25-covid_19.csv
- Datasets can have multiple snapshots with different
- Format agnostic
- Don’t force specific formats. Use
URIs and optionalMIMEtypes and let consumers figure things out. It is better than asking people to migrate to a specific format.
- Don’t force specific formats. Use
- Storage agnostic
- I’d keep anything related to schemas optional, as you can enforce them on specific file formats (e.g., Parquet vs. CSV).
- Bootstrap the catalog. There are many open indexes and organizations. Crawl them!
- Integrate with external repositories. E.g., a service that creates
JSON-LDfiles from the datasets it sees appearing on the Atmosphere so Google Datasets picks them up. The same cron job could push data into Hugging Face or any other tool that people are already using in their fields. - Convince and work with high quality organizations doing something like this! I’d definitely collaborate with
source.coopfor example.
Basically, whatever comes out of this should fit existing storage, files, and publishing habits and not require migration into a blessed stack (programming laguage, platform, …). It should allow anyone to mirror, fix, annotate, and republish datasets.
There are many interesting ideas to follow up too! Curating datasets into collections, reputation, or simple things like linking datasets. I’m very excited to continue these discussions and see where we go! For now, I think starting with something like this would be enough to see some interesting atmospheric data portals pop up.
Footnotes
-
For example, Hugging Face Datasets offers a git-repo-like space for you to put the data and a metadata file. Since they own that space, they can search across all their datasets. Easy, but not open or decentralized. On the other hand, you have the Data Package spec, used by organizations like OWID and maintained by a more neutral actor. The issue there is discovery. The best you can do today is search GitHub. There are many more examples here. ↩