-
Notifications
You must be signed in to change notification settings - Fork 89
Lebovits/issu 1016 ndvi #1171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: staging
Are you sure you want to change the base?
Lebovits/issu 1016 ndvi #1171
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the number and small size of the helper classes. With our services, I'm planning on moving to a pattern where the service takes the feature layer but then calls to a transform and merge function that just rely on gdfs so they can be tested without worrying about the feature layer class.
I added some suggestions to reduce how much we need to mock and also had a concern regarding whether the row length of results
could ever be different than primary_featurelayer.gdf
.
This should be some very interesting and helpful functionality!
data/src/new_etl/data_utils/ndvi.py
Outdated
from ..constants.services import CENSUS_BGS_URL | ||
|
||
|
||
def get_current_summer_year() -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may produce surprising behavior around the end of May/early June if we are trying to rerun a pipeline or recreate old results. we may want to add an optional date as a parameter so we can control this when needed. That would eliminate the need for freezegun in the tests.
I haven't looked yet how this is used elsewhere but is there a lag between when it becomes summer and when data is available for that summer we need to worry about?
data/src/new_etl/data_utils/ndvi.py
Outdated
return today.year | ||
|
||
|
||
def get_bbox_from_census_data() -> Tuple[Polygon, Tuple[float, float, float, float]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make this easier to test, we may want this function to take an optional gdf rather than always reading from CENSUS_BGS_URL.
data/src/test/test_ndvi.py
Outdated
self.assertFalse(is_cache_valid(wrong_year_path, 2025)) | ||
self.assertTrue(is_cache_valid(self.test_raster_path, 2025)) | ||
|
||
@patch("geopandas.read_file") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See if patch/mocks can be removed after allowing an optional df being sent to this function
data/src/test/test_ndvi.py
Outdated
@patch("new_etl.data_utils.ndvi.odc.stac.stac_load") | ||
@patch("new_etl.data_utils.ndvi.get_current_summer_year") | ||
@patch("new_etl.data_utils.ndvi.pystac_client.Client.open") | ||
def test_generate_ndvi_data_failure( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless you are actually concerned the correct exception won't be hit, I'd just remove this test.
data/src/test/test_ndvi.py
Outdated
|
||
@patch("tempfile.gettempdir") | ||
@patch("new_etl.data_utils.ndvi.get_current_summer_year") | ||
@patch("new_etl.data_utils.ndvi.get_bbox_from_census_data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is hard to reason about tests when so much is patched. See if we can eliminate at least 2 of the patches from my suggestions.
data/src/new_etl/data_utils/ndvi.py
Outdated
|
||
# Update the primary feature layer with the new columns from results | ||
# (instead of creating a new one) | ||
for column in results.columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that results is on a cleaned version of the gdf and we're adding columns into a version that hasn't been cleaned, is there the possibility for misalignment or different row lengths here when merging in the columns?
It is probably worth making this logic a separate function and trying out some filtering on a df and making sure things merge okay.
This PR has been marked as stale because it has been open for 7 days with no activity. |
New Feature: Add NDVI and One-Year NDVI Change to ETL Pipeline
Description
This PR addresses issue #1016. It uses Planetary Computer to query Sentinel 2 satellite imagery for the most recent summer and the summer prior to that. It returns a raster with the most recent summer's NDVI at 10m resolution, plus the one-year change in NDVI from the prior summer, each as a separate band in the raster. It saves the raster as a file in the
tmp
directory (currently coming in around 65MB), then usesexactextract
to get the mean NDVI and mean NDVI change for each parcel in the primary feature layer dataset (fall 500k+ properties in Philadelphia). The code is set up to check whether the data for the most recent summer already exists in thetmp
folder. If it does, it skips straight to extraction. The full code without cached data runs in 10-15 minutes. With the cached data, it takes about 3 minutes. I've also added unit tests intest/test_ndvi.py
.