Skip to content

Commit 775c2c8

Browse files
mpiannucciTomNicholaspre-commit-ci[bot]
authored
Add Icechunk Support (#256)
* move vds_with_manifest_arrays fixture up * sketch implementation * test that we can create an icechunk store * fixture to create icechunk filestore in temporary directory * get the async fixture working properly * split into more functions * change mode * try creating zarr group and arrays explicitly * create root group from store * todos * do away with the async pytest fixtures/functions * successfully writes root group attrs * check array metadata is correct * try to write array attributes * sketch test for checking virtual references have been set correctly * test setting single virtual ref * use async properly * better separation of handling of loadable variables * fix chunk key format * use require_array * check that store supports writes * removed outdated note about awaiting * fix incorrect chunk key in test * absolute path * convert to file URI before handing to icechunk * test that without encoding we can definitely read one chunk * Work on encoding test * Update test to match * Quick comment * more comprehensive * add attrtirbute encoding * Fix array dimensions * Fix v3 codec pipeline * Put xarray dep back * Handle codecs, but get bad results * Gzip an d zlib are not directly working * Get up working with numcodecs zarr 3 codecs * Update codec pipeline * oUdpate to latest icechunk using sync api * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Some type stuff * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update zarr and icechunk tests, fix zarr v3 metadata * Update import we dont need * Update kerhcunk tests * Check for v3 metadata import in zarr test * More tests * type checker * types * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * More types * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ooops * One left * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Finally done being dumb * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Support loadables without tests * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add test for multiple chunks to check order * Add loadable varaible test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add accessor, simple docs * Update icechunk.py Co-authored-by: Tom Nicholas <[email protected]> * Update accessor.py Co-authored-by: Tom Nicholas <[email protected]> * Fix attributes when loadables are available * Protect zarr import * Fix import errors in icechunk writer * More protection * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * i am bad at this * Add xarray roundtrip asserts * Add icechunk to api.rst * Update virtualizarr/tests/test_writers/test_icechunk.py Co-authored-by: Tom Nicholas <[email protected]> * More test improvements, update realeses.rst * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tmore testing * Figure out tests for real this time * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: TomNicholas <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent 4b7612e commit 775c2c8

File tree

16 files changed

+622
-59
lines changed

16 files changed

+622
-59
lines changed

ci/upstream.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ dependencies:
2424
- fsspec
2525
- pip
2626
- pip:
27-
- zarr==3.0.0b1 # beta release of zarr-python v3
27+
- icechunk # Installs zarr v3 as dependency
2828
- git+https://github.com/pydata/xarray@zarr-v3 # zarr-v3 compatibility branch
2929
- git+https://github.com/zarr-developers/numcodecs@zarr3-codecs # zarr-v3 compatibility branch
3030
# - git+https://github.com/fsspec/kerchunk@main # kerchunk is currently incompatible with zarr-python v3 (https://github.com/fsspec/kerchunk/pull/516)

conftest.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
import h5py
2+
import numpy as np
23
import pytest
34
import xarray as xr
5+
from xarray.core.variable import Variable
46

57

68
def pytest_addoption(parser):
@@ -96,3 +98,16 @@ def hdf5_scalar(tmpdir):
9698
dataset = f.create_dataset("scalar", data=0.1, dtype="float32")
9799
dataset.attrs["scalar"] = "true"
98100
return filepath
101+
102+
103+
@pytest.fixture
104+
def simple_netcdf4(tmpdir):
105+
filepath = f"{tmpdir}/simple.nc"
106+
107+
arr = np.arange(12, dtype=np.dtype("int32")).reshape(3, 4)
108+
var = Variable(data=arr, dims=["x", "y"])
109+
ds = xr.Dataset({"foo": var})
110+
111+
ds.to_netcdf(filepath)
112+
113+
return filepath

docs/api.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ Serialization
3939

4040
VirtualiZarrDatasetAccessor.to_kerchunk
4141
VirtualiZarrDatasetAccessor.to_zarr
42+
VirtualiZarrDatasetAccessor.to_icechunk
4243

4344

4445
Rewriting

docs/releases.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,9 @@ New Features
3131
- Support empty files (:pull:`260`)
3232
By `Justus Magin <https://github.com/keewis>`_.
3333

34+
- Can write virtual datasets to Icechunk stores using `vitualize.to_icechunk` (:pull:`256`)
35+
By `Matt Iannucci <https://github.com/mpiannucci>`_.
36+
3437
Breaking changes
3538
~~~~~~~~~~~~~~~~
3639

docs/usage.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -396,6 +396,23 @@ combined_ds = xr.open_dataset('combined.parq', engine="kerchunk")
396396

397397
By default references are placed in separate parquet file when the total number of references exceeds `record_size`. If there are fewer than `categorical_threshold` unique urls referenced by a particular variable, url will be stored as a categorical variable.
398398

399+
### Writing to an Icechunk Store
400+
401+
We can also write these references out as an [IcechunkStore](https://icechunk.io/). `Icechunk` is a Open-source, cloud-native transactional tensor storage engine that is compatible with zarr version 3. To export our virtual dataset to an `Icechunk` Store, we simply use the {py:meth}`ds.virtualize.to_icechunk <virtualizarr.xarray.VirtualiZarrDatasetAccessor.to_icechunk>` accessor method.
402+
403+
```python
404+
# create an icechunk store
405+
from icechunk import IcechunkStore, StorageConfig, StoreConfig, VirtualRefConfig
406+
storage = StorageConfig.filesystem(str('combined'))
407+
store = IcechunkStore.create(storage=storage, mode="w", config=StoreConfig(
408+
virtual_ref_config=VirtualRefConfig.s3_anonymous(region='us-east-1'),
409+
))
410+
411+
combined_vds.virtualize.to_icechunk(store)
412+
```
413+
414+
See the [Icechunk documentation](https://icechunk.io/icechunk-python/virtual/#creating-a-virtual-dataset-with-virtualizarr) for more details.
415+
399416
### Writing as Zarr
400417

401418
Alternatively, we can write these references out as an actual Zarr store, at least one that is compliant with the [proposed "Chunk Manifest" ZEP](https://github.com/zarr-developers/zarr-specs/issues/287). To do this we simply use the {py:meth}`ds.virtualize.to_zarr <virtualizarr.xarray.VirtualiZarrDatasetAccessor.to_zarr>` accessor method.

virtualizarr/accessor.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
from pathlib import Path
22
from typing import (
3+
TYPE_CHECKING,
34
Callable,
45
Literal,
56
overload,
@@ -12,6 +13,9 @@
1213
from virtualizarr.writers.kerchunk import dataset_to_kerchunk_refs
1314
from virtualizarr.writers.zarr import dataset_to_zarr
1415

16+
if TYPE_CHECKING:
17+
from icechunk import IcechunkStore # type: ignore[import-not-found]
18+
1519

1620
@register_dataset_accessor("virtualize")
1721
class VirtualiZarrDatasetAccessor:
@@ -39,6 +43,20 @@ def to_zarr(self, storepath: str) -> None:
3943
"""
4044
dataset_to_zarr(self.ds, storepath)
4145

46+
def to_icechunk(self, store: "IcechunkStore") -> None:
47+
"""
48+
Write an xarray dataset to an Icechunk store.
49+
50+
Any variables backed by ManifestArray objects will be be written as virtual references, any other variables will be loaded into memory before their binary chunk data is written into the store.
51+
52+
Parameters
53+
----------
54+
store: IcechunkStore
55+
"""
56+
from virtualizarr.writers.icechunk import dataset_to_icechunk
57+
58+
dataset_to_icechunk(self.ds, store)
59+
4260
@overload
4361
def to_kerchunk(
4462
self, filepath: None, format: Literal["dict"]

virtualizarr/readers/zarr_v3.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -150,5 +150,7 @@ def _configurable_to_num_codec_config(configurable: dict) -> dict:
150150
"""
151151
configurable_copy = configurable.copy()
152152
codec_id = configurable_copy.pop("name")
153+
if codec_id.startswith("numcodecs."):
154+
codec_id = codec_id[len("numcodecs.") :]
153155
configuration = configurable_copy.pop("configuration")
154156
return numcodecs.get_codec({"id": codec_id, **configuration}).get_config()

virtualizarr/tests/test_integration.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ def test_kerchunk_roundtrip_in_memory_no_concat():
2727
chunks=(2, 2),
2828
compressor=None,
2929
filters=None,
30-
fill_value=np.nan,
30+
fill_value=None,
3131
order="C",
3232
),
3333
chunkmanifest=manifest,

virtualizarr/tests/test_manifests/test_array.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ def test_create_manifestarray_from_kerchunk_refs(self):
4747
assert marr.chunks == (2, 3)
4848
assert marr.dtype == np.dtype("int64")
4949
assert marr.zarray.compressor is None
50-
assert marr.zarray.fill_value is np.nan
50+
assert marr.zarray.fill_value == 0
5151
assert marr.zarray.filters is None
5252
assert marr.zarray.order == "C"
5353

virtualizarr/tests/test_readers/test_kerchunk.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ def test_dataset_from_df_refs():
3737

3838
assert da.data.zarray.compressor is None
3939
assert da.data.zarray.filters is None
40-
assert da.data.zarray.fill_value is np.nan
40+
assert da.data.zarray.fill_value == 0
4141
assert da.data.zarray.order == "C"
4242

4343
assert da.data.manifest.dict() == {

0 commit comments

Comments
 (0)