Python Data Engineering

Posted on May 12, 2025

A few weeks ago, I had the opportunity to present at the April 2025 Christchurch Python meetup. My talk focussed on modern approaches to storing and retrieving tabular and multi-dimensional data, leveraging cost-effective storage systems and decoupling storage from compute. Some topics covered include:

the Parquet file format
how columnar storage works
querying Parquet files with DuckDB and Clickhouse
Arrow
Apache Iceberg
Various approaches to handling multi-dimension data
The Zarr file format
Icechunk

For those interested, both the slides and demo code (zip) I used are available.