Datafusion table provider for Icechunk Level 2 metadata schema

ref https://github.com/NASA-IMPACT/veda-odd/issues/207

### Datafusion Table Provider for Icechunk Level 2 metadata schema

We're investigating options for how we can store granule level metadata and array data in a single Icechunk store to support Level 2 and non-regularly gridded NASA datasets.  As a first step we need to design a "schema" for how we can serialize tabular or document based metadata into Zarr arrays and then use a Datafusion table provider to query this "schema" for a subset.

The initial concept proposed by @rabernat is a `meta` Zarr group which contains a collection of 1D arrays for each metadata property.  At a minimum they would have something like
```
meta
├── date
├── bbox
└── collection
```
`date` - dtype is `numpy.datetime64`
`bbox` - contains WKT and the dtype is `vlen` or WKB and the dtype `bytes` (not sure on extension dtype support here)
`collection` - dtype` is `vlen`

The initial table provider implementation should be able to use a query like
```
SELECT idx
FROM meta
WHERE date  < TIMESTAMP '2025-07-01 00:00:00' AND  ST_Intersects(
    bbox,
    ST_GeomFromText('POLYGON((0 0, 0 10, 10 10, 10 0, 0 0))')
);
```
And return a list of indices (more on how we might tie this its associated array data in a moment).  My biggest question here is how Datafusion can generate the most efficient incremental scan plan when we need to filter multiple arrays? 

This should be our initial goal and then we can move onto ...

### Datafusion Table Provider for Icechunk materialized indexes
Ideally we'd like to be able generate separate indices for these 1d arrays, store them as serialized arrays within Icechunk and have Datafusion materialize them and use them for predicate pushdown operations when available.  I'd propose trying to serialize a geo-index for `bbox` and some form of b-tree for `date` and then see how we can interact with those.

@kylebarron These would be the initial asks for you to investigate.  In parallel, @maxrjones and I can start to investigate how our proposed "middleware" library could use `idx`s returned from the query to
1. Retrieve the associated array chunks and present them as a zarr like structure that can be accessed via `zarr-python`
2. Embed the requested metadata fields into the `.zattrs` for each presented array. (As flat attributes for now).

This is just some broad strokes to keep you busy while I'm out 😆 so feel free to comment / edit as you see fit.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Datafusion table provider for Icechunk Level 2 metadata schema #212

Datafusion Table Provider for Icechunk Level 2 metadata schema

Datafusion Table Provider for Icechunk materialized indexes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Datafusion table provider for Icechunk Level 2 metadata schema #212

Description

Datafusion Table Provider for Icechunk Level 2 metadata schema

Datafusion Table Provider for Icechunk materialized indexes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions