Skip to content

Datafusion table provider for Icechunk Level 2 metadata schema #212

@sharkinsspatial

Description

@sharkinsspatial

ref #207

Datafusion Table Provider for Icechunk Level 2 metadata schema

We're investigating options for how we can store granule level metadata and array data in a single Icechunk store to support Level 2 and non-regularly gridded NASA datasets. As a first step we need to design a "schema" for how we can serialize tabular or document based metadata into Zarr arrays and then use a Datafusion table provider to query this "schema" for a subset.

The initial concept proposed by @rabernat is a meta Zarr group which contains a collection of 1D arrays for each metadata property. At a minimum they would have something like

meta
├── date
├── bbox
└── collection

date - dtype is numpy.datetime64
bbox - contains WKT and the dtype is vlen or WKB and the dtype bytes (not sure on extension dtype support here)
collection - dtypeisvlen`

The initial table provider implementation should be able to use a query like

SELECT idx
FROM meta
WHERE date  < TIMESTAMP '2025-07-01 00:00:00' AND  ST_Intersects(
    bbox,
    ST_GeomFromText('POLYGON((0 0, 0 10, 10 10, 10 0, 0 0))')
);

And return a list of indices (more on how we might tie this its associated array data in a moment). My biggest question here is how Datafusion can generate the most efficient incremental scan plan when we need to filter multiple arrays?

This should be our initial goal and then we can move onto ...

Datafusion Table Provider for Icechunk materialized indexes

Ideally we'd like to be able generate separate indices for these 1d arrays, store them as serialized arrays within Icechunk and have Datafusion materialize them and use them for predicate pushdown operations when available. I'd propose trying to serialize a geo-index for bbox and some form of b-tree for date and then see how we can interact with those.

@kylebarron These would be the initial asks for you to investigate. In parallel, @maxrjones and I can start to investigate how our proposed "middleware" library could use idxs returned from the query to

  1. Retrieve the associated array chunks and present them as a zarr like structure that can be accessed via zarr-python
  2. Embed the requested metadata fields into the .zattrs for each presented array. (As flat attributes for now).

This is just some broad strokes to keep you busy while I'm out 😆 so feel free to comment / edit as you see fit.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions