Skip to content

HLS STAC Geoparquet Archive #210

@hrodmn

Description

@hrodmn

Background

In PI 25.3, as part of a MAAP science support task, I built a pipeline to build a STAC geoparquet copy of the published HLS STAC records to an S3 bucket as a means to help MAAP users who have been hampered by CMR STAC rate limits on the HLS collections. The result is a 6 GB STAC geoparquet store that is now accessible to MAAP users to query using rustac:

from rustac import DuckdbClient


client = DuckdbClient(use_hive_partitioning=True)

# configure duckdb to find S3 credentials
client.execute(
    """
    CREATE OR REPLACE SECRET secret (
         TYPE S3,
         PROVIDER CREDENTIAL_CHAIN
    );
    """
)

# use rustac/duckdb to search through the partitioned parquet dataset to find matching items
results = client.search(
    href="s3://maap-ops-workspace/shared/henrydevseed/hls-stac-geoparquet-v1/year=*/month=*/*.parquet",
    datetime="2025-05-01T00:00:00Z/2025-05-31T23:59:59Z",
    bbox=(-90, 45, -85, 50),
)

The STAC geoparquet archive contains all granules from the entire HLS archive through May 2025, but there is not a pipeline in place to keep it updated going forward.

It was built like this in a pipeline running in AWS batch with a task for each year-month from mid-2013 to May 2025:

For each day in the year-month:

  1. Query CMR using python-cmr for all granules in the HLS collections
  2. For each granule, pluck the href from the link to the STAC JSON record that is produced as part of the HLS processing pipeline
  3. Read the JSON files and write to a daily STAC geoparquet file using rustac
  4. Use duckdb to combine the daily parquet files and write the result to the new year-month partition in S3

Processing the entire historical archive took about ~1 day of continuous processing by two modest EC2 instances. A single year-month takes 45 minutes at most.

I haven't fully implemented it yet but the CMR team recently opened up the CMR GraphQL API which makes it pretty easy to write a very specific granule query that targets the metadata links (and nothing else), which could improve efficiency of the link fetching process.

Vision

My vision is to maintain a public archive of all of the HLS STAC records in a well-paratitioned geoparquet store that can be queried with tools like rustac and duckdb.

This archive should not be used for real-time queries for the latest granules, but rather for high-volume queries of the historical archive. Performance will be slower for individual queries compared to querying CMR STAC or another STAC API, but there will be no API rate limit except for the GET request limit from S3 as well as no maintenance burden for the host of the archive.

This is a stop-gap solution to help address known limitations of CMR for serving the needs of large-scale users. At the very least, producing this archive could provide a benchmark for comparison to other solutions (e.g. level-2 data tree zarr archive). At best, our experience can inform future plans for providing an API-less mechanism for retrieving STAC records from CMR.

Requirements

The basic characteristics of the STAC geoparquet archive would be something like this:

  1. Partitioned by collection (HLSL30_2.0 + HLSS30_2.0), year, month, and possibly row-group partitioning using bounding box
  2. Each year-month (e.g. June 2025) would be processed automatically on a schedule at some point during the following month (e.g. July 15 2025). This would yield a recency of ~45 days for the archive. Maybe there is a more elegant solution involving something like Apache Iceberg that could facilitate more frequet updates and help improve recency, but for the MVP maybe that's not necessary.
  3. The archive must contain all of the records that are available via the CMR API (no gaps or missing records)
  4. Ideally, the product should be stored in a public bucket with ls access enabled (to facilitate clients that want to use the hive-partitioning scheme) @kylebarron

I am not exactly sure how this idea will fit into our plans for 25.4 but I want to put the idea down so we can at least consider it against other priorities.

cc @abarciauskas-bgse @sharkinsspatial @kylebarron @wildintellect @briannapagan @brianmfreitag

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions