DuckDB and BigQuery Storage API
Fri Apr 05 2024BigQuery has a not so well known API, Storage API, that let’s you grab a result set or table as Arrow datasets. It is cheaper than the standard query costs and integrates with all the rest of the Arrow ecosystem, like DuckDB.
You can test it out with this code:
import duckdb
from google.cloud import bigquery
bqclient = bigquery.Client()
table = bigquery.TableReference.from_string(
"bigquery-public-data.samples.shakespeare"
)
rows = bqclient.list_rows(table)
shakespeare = rows.to_arrow(create_bqstorage_client=True)
conn = duckdb.connect(":memory:")
conn.sql("""
select
word,
sum(word_count)
from shakespeare
group by 1
order by 2 desc
limit 10
""")
Gives you a result like this:
┌─────────┬─────────────────┐
│ word │ sum(word_count) │
│ varchar │ int128 │
├─────────┼─────────────────┤
│ the │ 25568 │
│ I │ 21028 │
│ and │ 19649 │
│ to │ 17361 │
│ of │ 16438 │
│ a │ 13409 │
│ you │ 12527 │
│ my │ 11291 │
│ in │ 10589 │
│ is │ 8735 │
├─────────┴─────────────────┤
│ 10 rows 2 columns │
└───────────────────────────┘
The code is also available as a Google Colab Notebook!