Data Access and Basic Processing with Pangeo ecosystem

Cubes & Clouds logo

Data Access and Basic Processing with Pangeo ecosystem#

The exercise will use the Pangeo ecosystem to access and process data.

Quiz hint: remember this information for the final quiz!

Lazy data loading#

When accessing data using an API, most of the time the data is lazily loaded.

It means that only the metadata is loaded, so that it is possible to know about the data dimensions and their extents (spatial and temporal), the available bands and other additional information.

Let’s start with a call to the STAC Catalogue Python Libraries pystac_client for lazily loading some Sentinel-2 data from a public STAC Collection.

We need to specify an Area Of Interest (AOI) to get only part of the Collection, otherwise our code would try to load the metadata of all Sentinel-2 tiles available in the world!

# STAC Catalogue Libraries
import pystac_client
import stackstac
spatial_extent = [11.1, 46.1, 11.5, 46.5]
URL = "https://earth-search.aws.element84.com/v1"
catalog = pystac_client.Client.open(URL)
items = catalog.search(
    bbox=spatial_extent,
    collections=["sentinel-2-l2a"]
).item_collection()

Calling stackstac.stack() method for the items, the data will be lazily loaded and an xArray.DataArray object returned.

Running the next cell will show the selected data content with the dimension names and their extent:

datacube = stackstac.stack(items, bounds_latlon=spatial_extent)
datacube
/srv/conda/envs/notebook/lib/python3.11/site-packages/stackstac/prepare.py:408: UserWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
  times = pd.to_datetime(
<xarray.DataArray 'stackstac-2e23865ac45938d8eaf61fffa22fad58' (time: 1259,
                                                                band: 32,
                                                                y: 4535, x: 3210)>
dask.array<fetch_raster_window, shape=(1259, 32, 4535, 3210), dtype=float64, chunksize=(1, 1, 1024, 1024), chunktype=numpy.ndarray>
Coordinates: (12/53)
  * time                                     (time) datetime64[ns] 2016-11-05...
    id                                       (time) <U24 'S2A_32TPS_20161105_...
  * band                                     (band) <U12 'aot' ... 'wvp-jp2'
  * x                                        (x) float64 6.611e+05 ... 6.932e+05
  * y                                        (y) float64 5.153e+06 ... 5.107e+06
    s2:not_vegetated_percentage              (time) object 0.164157 ... 10.89...
    ...                                       ...
    title                                    (band) <U31 'Aerosol optical thi...
    gsd                                      (band) object None 10 ... None None
    common_name                              (band) object None 'blue' ... None
    center_wavelength                        (band) object None 0.49 ... None
    full_width_half_max                      (band) object None 0.098 ... None
    epsg                                     int64 32632
Attributes:
    spec:        RasterSpec(epsg=32632, bounds=(661130.0, 5107300.0, 693230.0...
    crs:         epsg:32632
    transform:   | 10.00, 0.00, 661130.00|\n| 0.00,-10.00, 5152650.00|\n| 0.0...
    resolution:  10.0

From the output of the previous cell you can notice something really interesting: the size of the selected data is more than 3 TB!

But you should have noticed that it was too quick to download this huge amount of data.

This is what lazy loading allows: getting all the information about the data in a quick manner without having to access and download all the available files.

Quiz hint: look carefully at the dimensions of the loaded datacube!