On Blockchain Data Pipelines

Fri Apr 07 2023

I’ve spent the last few months working on indexing and building data pipelines for the Filecoin blockchain. While it’s been a great and exciting learning experience, I’ve realized the space can learn a few things from the so called Modern Data ecosystem.

The main thing I’d love to explore is how, as a community, we’re building the same pipelines over and over and not collaborating on them. Let’s start with a bit of context.

Existing Projects

If you do a bit of browsing, you’ll find many companies and tools building ETLs for different blockchains. I compiled this non-exhaustive list of tools to index chains and companies providing the final datasets. I’m sure I missed a few, so please let me know if you know of any other projects!

Companies

Tools

The Problem

After compiling the list, I realized that only a few of these projects are open. We can read the source code of the chains we use, but can’t read the code of their data pipelines? That’s a bit weird. Specially when the data world is moving towards the other direction 1!

As the folks from OpenDataCommunity pointed out, the data layer is, currently, mostly centralized 2 and closed source. Properties that don’t represent the open spirit of the movement.

On the other hand, projects like Airbyte, Estuary’s Flow, Meltano, Cloudquery, and many others from the MDS, are not only building tools to extract, transform, and load data, but also working on standards and protocols. And the community is building on top of them. This makes possible to have an end to end data pipeline in matter of minutes.

  1. Connect an Stripe source to your warehouse. A few clicks.
  2. Import the data. Some extra clicks.
  3. Add a dbt Package to your project.
  4. Done. You’ve gone from zero to a somewhat complex report of your Stripe subscriptions!

Wouldn’t it be great if we could do the same with blockchain data? Add the tap-ethereum connector, install the dbt-ethereum package and start to collaborate! 3

Personally, I think projects like Trueblocks, Kamu, and Bacalhau hold the key to open the data layer a bit more 4 but won’t be possible without a community and standards we can all agree upon and share.

Footnotes

  1. And the blockchains data is such a great candidate for the open data movement. Chain data is open, verifiable and immutable! All great properties for open data.

  2. And I totally get it. Centralization makes working with data a less painful job!

  3. Dune has the spellbook and is awesome. I wish they did something similar for their pipelines though.

  4. Bacalhau indexes the chain with Trueblocks, puts the data into Parquet files on IPFS and you query it from anywhere using DuckDB!

← Back to home!