Python Data Engineering
Posted on
A few weeks ago, I had the opportunity to present at the April 2025 Christchurch Python meetup. My talk focussed on modern approaches to storing and retrieving tabular and multi-dimensional data, leveraging cost-effective storage systems and decoupling storage from compute. Some topics covered include:
- the Parquet file format
- how columnar storage works
- querying Parquet files with DuckDB and Clickhouse
- Arrow
- Apache Iceberg
- Various approaches to handling multi-dimension data
- The Zarr file format
- Icechunk
For those interested, both the slides and demo code (zip) I used are available.