Cubes & Clouds logo

Filter Operators with Pangeo#

Filter Operators#

When interacting with large data collections, it is necessary to keep in mind that it’s not possible to load everything!

Therefore, we always have to define our requirements in advance and apply them to the data using filter operators.

Let’s start again with the same sample data from the Sentinel-2 STAC Collection with an additional filter.

Properties Filter#

When working with optical data like Sentinel-2, most of the times we would like to discard cloudy acquisitions as soon as possible.

We can do it using a property filter: in this case we want to keep only the acquisitions with less than 50% cloud coverage.

# STAC Catalogue Libraries
import pystac_client
import stackstac
import numpy as np
from pyproj import Proj
spatial_extent = [11.1, 46.1, 11.5, 46.5]
URL = "https://earth-search.aws.element84.com/v1"
catalog = pystac_client.Client.open(URL)
items = catalog.search(
    bbox=spatial_extent,
    collections=["sentinel-2-l2a"]
).item_collection()

datacube = stackstac.stack(items,
                     bounds_latlon=spatial_extent,
)
datacube
/srv/conda/envs/notebook/lib/python3.11/site-packages/stackstac/prepare.py:408: UserWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
  times = pd.to_datetime(
<xarray.DataArray 'stackstac-c597f76de9eadb71259e77f79facceff' (time: 1259,
                                                                band: 32,
                                                                y: 4535, x: 3210)>
dask.array<fetch_raster_window, shape=(1259, 32, 4535, 3210), dtype=float64, chunksize=(1, 1, 1024, 1024), chunktype=numpy.ndarray>
Coordinates: (12/53)
  * time                                     (time) datetime64[ns] 2016-11-05...
    id                                       (time) <U24 'S2A_32TPS_20161105_...
  * band                                     (band) <U12 'aot' ... 'wvp-jp2'
  * x                                        (x) float64 6.611e+05 ... 6.932e+05
  * y                                        (y) float64 5.153e+06 ... 5.107e+06
    grid:code                                <U10 'MGRS-32TPS'
    ...                                       ...
    raster:bands                             (band) object [{'nodata': 0, 'da...
    gsd                                      (band) object None 10 ... None None
    common_name                              (band) object None 'blue' ... None
    center_wavelength                        (band) object None 0.49 ... None
    full_width_half_max                      (band) object None 0.098 ... None
    epsg                                     int64 32632
Attributes:
    spec:        RasterSpec(epsg=32632, bounds=(661130.0, 5107300.0, 693230.0...
    crs:         epsg:32632
    transform:   | 10.00, 0.00, 661130.00|\n| 0.00,-10.00, 5152650.00|\n| 0.0...
    resolution:  10.0

Let’s filter out scenes with >50% cloud coverage (according to the eo:cloud_cover field set by the data provider).

lowcloud = datacube[datacube["eo:cloud_cover"] < 50]
lowcloud
<xarray.DataArray 'stackstac-c597f76de9eadb71259e77f79facceff' (time: 688,
                                                                band: 32,
                                                                y: 4535, x: 3210)>
dask.array<getitem, shape=(688, 32, 4535, 3210), dtype=float64, chunksize=(1, 1, 1024, 1024), chunktype=numpy.ndarray>
Coordinates: (12/53)
  * time                                     (time) datetime64[ns] 2016-11-08...
    id                                       (time) <U24 'S2A_32TPS_20161108_...
  * band                                     (band) <U12 'aot' ... 'wvp-jp2'
  * x                                        (x) float64 6.611e+05 ... 6.932e+05
  * y                                        (y) float64 5.153e+06 ... 5.107e+06
    grid:code                                <U10 'MGRS-32TPS'
    ...                                       ...
    raster:bands                             (band) object [{'nodata': 0, 'da...
    gsd                                      (band) object None 10 ... None None
    common_name                              (band) object None 'blue' ... None
    center_wavelength                        (band) object None 0.49 ... None
    full_width_half_max                      (band) object None 0.098 ... None
    epsg                                     int64 32632
Attributes:
    spec:        RasterSpec(epsg=32632, bounds=(661130.0, 5107300.0, 693230.0...
    crs:         epsg:32632
    transform:   | 10.00, 0.00, 661130.00|\n| 0.00,-10.00, 5152650.00|\n| 0.0...
    resolution:  10.0

Temporal filter#

To filter along time the data collection with Pangeo Xarray, we can use where method from Xarray.

Note: We cannot use the sel method to select a slice to filter on the time dimension since the times are not equally spaced.

temporal_slice = lowcloud.where((lowcloud.time >= np.datetime64("2022-05-10T00:00:00Z")) &  (lowcloud.time  <= np.datetime64("2022-06-30T00:00:00Z")))
temporal_slice
/tmp/ipykernel_2309/1692916584.py:1: DeprecationWarning: parsing timezone aware datetimes is deprecated; this will raise an error in the future
  temporal_slice = lowcloud.where((lowcloud.time >= np.datetime64("2022-05-10T00:00:00Z")) &  (lowcloud.time  <= np.datetime64("2022-06-30T00:00:00Z")))
<xarray.DataArray 'stackstac-c597f76de9eadb71259e77f79facceff' (time: 688,
                                                                band: 32,
                                                                y: 4535, x: 3210)>
dask.array<where, shape=(688, 32, 4535, 3210), dtype=float64, chunksize=(1, 1, 1024, 1024), chunktype=numpy.ndarray>
Coordinates: (12/53)
  * time                                     (time) datetime64[ns] 2016-11-08...
    id                                       (time) <U24 'S2A_32TPS_20161108_...
  * band                                     (band) <U12 'aot' ... 'wvp-jp2'
  * x                                        (x) float64 6.611e+05 ... 6.932e+05
  * y                                        (y) float64 5.153e+06 ... 5.107e+06
    grid:code                                <U10 'MGRS-32TPS'
    ...                                       ...
    raster:bands                             (band) object [{'nodata': 0, 'da...
    gsd                                      (band) object None 10 ... None None
    common_name                              (band) object None 'blue' ... None
    center_wavelength                        (band) object None 0.49 ... None
    full_width_half_max                      (band) object None 0.098 ... None
    epsg                                     int64 32632
Attributes:
    spec:        RasterSpec(epsg=32632, bounds=(661130.0, 5107300.0, 693230.0...
    crs:         epsg:32632
    transform:   | 10.00, 0.00, 661130.00|\n| 0.00,-10.00, 5152650.00|\n| 0.0...
    resolution:  10.0

After running the previous cell, it is visible that the result has less elements (or labels) in the temporal dimension time.

Additionally, the size of the selected data reduced a lot.

Quiz hint: look carefully at the dimensions of the resulting datacube!

Spatial filter#

To slice along the spatial dimensions the data collection with openEO, we can use sel method with a slice.

Let’s get the spatial extent expressed in the Coordinate Reference System of the datacube. We need to project latitudes, longitudes to the data cube CRS.

west = 11.259613; east = 11.406212
south = 46.461019; north = 46.522237
projection = Proj(temporal_slice.attrs["crs"])
x_ws, y_ws = projection(west, south)
x_wn, y_wn = projection(west, north)
x_es, y_es = projection(east, south)
x_en, y_en = projection(east, north)
xmax = max(x_ws, x_wn, x_es, x_en)
xmin = min(x_ws, x_wn, x_es, x_en)
ymax = max(y_ws, y_wn, y_es, y_en)
ymin = min(y_ws, y_wn, y_es, y_en)
print(xmin, xmax, ymin, ymax)
673311.3443826602 684762.3945897217 5147752.717505932 5154887.167103742

The sel method with slice is used with a set of coordinates.

Note: The order of the interval in the slice needs to be expressed in the same order as the corresponding coordinate.

spatial_slice = temporal_slice.sel(x=slice(xmin,xmax), y = slice(ymax,ymin))
spatial_slice
<xarray.DataArray 'stackstac-c597f76de9eadb71259e77f79facceff' (time: 688,
                                                                band: 32,
                                                                y: 490, x: 1145)>
dask.array<getitem, shape=(688, 32, 490, 1145), dtype=float64, chunksize=(1, 1, 490, 829), chunktype=numpy.ndarray>
Coordinates: (12/53)
  * time                                     (time) datetime64[ns] 2016-11-08...
    id                                       (time) <U24 'S2A_32TPS_20161108_...
  * band                                     (band) <U12 'aot' ... 'wvp-jp2'
  * x                                        (x) float64 6.733e+05 ... 6.848e+05
  * y                                        (y) float64 5.153e+06 ... 5.148e+06
    grid:code                                <U10 'MGRS-32TPS'
    ...                                       ...
    raster:bands                             (band) object [{'nodata': 0, 'da...
    gsd                                      (band) object None 10 ... None None
    common_name                              (band) object None 'blue' ... None
    center_wavelength                        (band) object None 0.49 ... None
    full_width_half_max                      (band) object None 0.098 ... None
    epsg                                     int64 32632
Attributes:
    spec:        RasterSpec(epsg=32632, bounds=(661130.0, 5107300.0, 693230.0...
    crs:         epsg:32632
    transform:   | 10.00, 0.00, 661130.00|\n| 0.00,-10.00, 5152650.00|\n| 0.0...
    resolution:  10.0

Quiz hint: look carefully at the dimensions of the loaded datacube!

Bands filter#

To slice along the bands dimension, keeping only the necessary bands, we can use the sel method and isin.

bands = ["red","green","blue"]
bands_slice = spatial_slice.sel(band=spatial_slice.band.isin(bands))
bands_slice
<xarray.DataArray 'stackstac-c597f76de9eadb71259e77f79facceff' (time: 688,
                                                                band: 3,
                                                                y: 490, x: 1145)>
dask.array<getitem, shape=(688, 3, 490, 1145), dtype=float64, chunksize=(1, 1, 490, 829), chunktype=numpy.ndarray>
Coordinates: (12/53)
  * time                                     (time) datetime64[ns] 2016-11-08...
    id                                       (time) <U24 'S2A_32TPS_20161108_...
  * band                                     (band) <U12 'blue' 'green' 'red'
  * x                                        (x) float64 6.733e+05 ... 6.848e+05
  * y                                        (y) float64 5.153e+06 ... 5.148e+06
    grid:code                                <U10 'MGRS-32TPS'
    ...                                       ...
    raster:bands                             (band) object None None None
    gsd                                      (band) object 10 10 10
    common_name                              (band) object 'blue' 'green' 'red'
    center_wavelength                        (band) object 0.49 0.56 0.665
    full_width_half_max                      (band) object 0.098 0.045 0.038
    epsg                                     int64 32632
Attributes:
    spec:        RasterSpec(epsg=32632, bounds=(661130.0, 5107300.0, 693230.0...
    crs:         epsg:32632
    transform:   | 10.00, 0.00, 661130.00|\n| 0.00,-10.00, 5152650.00|\n| 0.0...
    resolution:  10.0