-
Notifications
You must be signed in to change notification settings - Fork 0
Description
ref #207
Datafusion Table Provider for Icechunk Level 2 metadata schema
We're investigating options for how we can store granule level metadata and array data in a single Icechunk store to support Level 2 and non-regularly gridded NASA datasets. As a first step we need to design a "schema" for how we can serialize tabular or document based metadata into Zarr arrays and then use a Datafusion table provider to query this "schema" for a subset.
The initial concept proposed by @rabernat is a meta
Zarr group which contains a collection of 1D arrays for each metadata property. At a minimum they would have something like
meta
├── date
├── bbox
└── collection
date
- dtype is numpy.datetime64
bbox
- contains WKT and the dtype is vlen
or WKB and the dtype bytes
(not sure on extension dtype support here)
collection
- dtypeis
vlen`
The initial table provider implementation should be able to use a query like
SELECT idx
FROM meta
WHERE date < TIMESTAMP '2025-07-01 00:00:00' AND ST_Intersects(
bbox,
ST_GeomFromText('POLYGON((0 0, 0 10, 10 10, 10 0, 0 0))')
);
And return a list of indices (more on how we might tie this its associated array data in a moment). My biggest question here is how Datafusion can generate the most efficient incremental scan plan when we need to filter multiple arrays?
This should be our initial goal and then we can move onto ...
Datafusion Table Provider for Icechunk materialized indexes
Ideally we'd like to be able generate separate indices for these 1d arrays, store them as serialized arrays within Icechunk and have Datafusion materialize them and use them for predicate pushdown operations when available. I'd propose trying to serialize a geo-index for bbox
and some form of b-tree for date
and then see how we can interact with those.
@kylebarron These would be the initial asks for you to investigate. In parallel, @maxrjones and I can start to investigate how our proposed "middleware" library could use idx
s returned from the query to
- Retrieve the associated array chunks and present them as a zarr like structure that can be accessed via
zarr-python
- Embed the requested metadata fields into the
.zattrs
for each presented array. (As flat attributes for now).
This is just some broad strokes to keep you busy while I'm out 😆 so feel free to comment / edit as you see fit.