Growing My Own Data Platform

Wed Apr 22 2026

It has been a couple of months since I wrote about Barefoot Data Platforms. Since then, I’ve been applying those ideas to a real data platform I’ve been rebuilding from scratch. I’ve learned a lot about what makes these platforms work well, especially with agents.

Building It

As a quick overview, the entire orchestration engine (fdp) is ~700 lines of Python. Assets are plain .sql or .py files with ugly but useful metadata in comments. No decorators, no registration, no config files. You can drop a file in assets/, and the next run picks it up, resolves dependencies, and materializes it! Delete the file, and it’s gone too.

This minimalist, low-abstraction, filesystem-based data platform is a great fit for agents:

The DX feedback loop is tight. Change an asset, make check catches type and lint issues, materialize it, and fdp test catches data issues. Agents are able to self-correct quite reliably.

Inspired by Pi, I see this as a compelling self-modifying data platform. When I wanted to publish tables to Google Sheets, I asked the agent to build fdp publish gsheet. It read the existing R2 publish target, understood the pattern, and wrote the new one. The platform is growing to be exactly what I need, all within the context window of today’s agents!

This domain is not rocket science, so I’m convinced I can keep the fdp code small and have extensions for the different concepts. The kernel should only discover assets, resolve dependencies, execute assets, materialize results, run checks, and introspect. Everything outside the kernel belongs in userland, including adapters for Python, SQL, Bash, notebooks, and external tools, targets/materializers for tables, views, files, APIs, and sinks, publishers for GSheets, R2, APIs, etc.

Serving the Data

I also spent quite a while thinking about the end-user/stakeholder experience. Up until recently, users were redirected to a website with documentation on the Parquet files and a few examples and use cases (e.g., how to consume them from a public spreadsheet). This clearly created friction and only a handful relied on the public sheets or files.

Taking inspiration from Tempo’s agent-focused docs UX, I made a SKILL.md (auto-generated from live data and published alongside the datasets) that lists every available dataset with columns and descriptions, points to the Parquet files, and teaches agents how to query and visualize them. Anyone can open Codex, Pi, Claude, or any other agent and say:

Read https://filecoindataportal.xyz/SKILL.md and tell me which providers onboarded the most data last week

Nothing else is required. The agent reads the skill, fetches the Parquet file with whatever tools the user has (duckdb, JavaScript, …), runs the query, and answers. People have been doing exactly this with different agents and it works better than I expected.

Users shouldn’t blindly trust the agent’s responses, but now they have something to poke with quick data requests that force them to be specific and clear. I can then be notified after a session with the clarified questions and agent output.

Conclusion

I’m very happy with this iteration and approach so far. It is both fast to iterate on and to work with. No major downsides for now, but I’m sure something will pop up!

As for the future, I’ll probably keep expanding the different resources (e.g., fdp can run SQL files on BigQuery) and continue thinking about the minimal kernel I need for this platform.

← Back to home