3.1 Data Processing

3.1 Data Processing#

In this exercise we will build a complete EO workflow on a cloud platform; from data access to obtaining the result. In this example we will analyse snow cover in the Alps.

We are going to follow these steps in our analysis:

Load satellite collections
Specify the spatial, temporal extents and the features we are interested in
Process the satellite data to retrieve snow cover information
aggregate information in data cubes
Visualize and analyse the results

More information on the openEO Python Client: https://open-eo.github.io/openeo-python-client/index.html

Libraries#

%%capture
pip install openeo rioxarray geopandas leafmap h5netcdf netcdf4

# platform libraries
import openeo

# utility libraries
from datetime import date
import numpy as np
import xarray as xr
import rioxarray
import json
import pandas as pd
import matplotlib.pyplot as plt
import geopandas as gpd
import leafmap.foliumap as leafmap

Login#

Connect to the copernicus dataspace ecosystem.

conn = openeo.connect('https://openeo.cloud/')

And login

conn.authenticate_oidc()

Authenticated using refresh token.

<Connection to 'https://openeocloud.vito.be/openeo/1.0.0/' with OidcBearerAuth>

Check if the login worked.

conn.describe_account()

{'default_plan': 'generic',
 'info': {'oidc_userinfo': {'eduperson_assurance': ['https://refeds.org/assurance/IAP/low',
    'https://aai.egi.eu/LoA#Substantial'],
   'eduperson_entitlement': ['urn:mace:egi.eu:group:vo.notebooks.egi.eu:role=member#aai.egi.eu',
    'urn:mace:egi.eu:group:vo.notebooks.egi.eu:role=vm_operator#aai.egi.eu',
    'urn:mace:egi.eu:group:vo.openeo.cloud:role=early_adopter#aai.egi.eu',
    'urn:mace:egi.eu:group:vo.openeo.cloud:role=platform_developer#aai.egi.eu',
    'urn:mace:egi.eu:group:vo.openeo.cloud:role=employee#aai.egi.eu',
    'urn:mace:egi.eu:group:vo.openeo.cloud:role=member#aai.egi.eu',
    'urn:mace:egi.eu:group:vo.openeo.cloud:role=vm_operator#aai.egi.eu'],
   'eduperson_scoped_affiliation': ['employee@eurac.edu', 'member@eurac.edu'],
   'email': 'michele.claus@eurac.edu',
   'email_verified': True,
   'sub': '28860add33a3a705c0092d54ae94177752ba9fa4bf00a25b24864ea95b9e0025@egi.eu',
   'voperson_verified_email': ['michele.claus@eurac.edu']}},
 'name': 'michele.claus@eurac.edu',
 'roles': ['EarlyAdopter', 'PlatformDeveloper'],
 'user_id': '28860add33a3a705c0092d54ae94177752ba9fa4bf00a25b24864ea95b9e0025@egi.eu'}

Region of Interest#

Load the catchment area.

catchment_outline = gpd.read_file('../data/catchment_outline.geojson')

center = (float(catchment_outline.centroid.y), float(catchment_outline.centroid.x))
m = leafmap.Map(center=center, zoom=10)
m.add_vector('../data/catchment_outline.geojson', layer_name="catchment")
m

Inspect Metadata#

We need to set the following configurations to define the content of the data cube we want to access:

dataset name
band names
time range
the area of interest specifed via bounding box coordinates
spatial resolution

To select the correct dataset we can first list all the available datasets.

conn.list_collections()

We want to use the Sentinel-2 L2A product. It’s name is 'SENTINEL2_L2A'.

We get the metadata for this collection as follows.

conn.describe_collection("SENTINEL2_L2A")

Define a workflow#

We will define our workflow now. And chain all the processes together we need for analyzing the snow cover in the catchment.

Define the data cube#

We define all extents of our data cube. We use the catchment as spatial extent. As a time range we will focus on the snow melting season 2018, in particular from Febraury to June 2018.

bbox = catchment_outline.bounds.iloc[0]
bbox

minx    11.020833
miny    46.653599
maxx    11.366667
maxy    46.954167
Name: 0, dtype: float64

from openeo.processes import lte
collection      = 'SENTINEL2_L2A'
spatial_extent  = {'west':bbox[0],'east':bbox[2],'south':bbox[1],'north':bbox[3],'crs':4326}
temporal_extent = ["2022-02-01", "2022-06-30"]
bands           = ['B03', 'B11']
properties={"eo:cloud_cover": lambda x: lte(x, 90)}

Load the data cube#

We have defined the extents we are interested in. Now we use these definitions to load the data cube.

s2 = conn.load_collection(collection,
                          spatial_extent=spatial_extent,
                          bands=bands,
                          temporal_extent=temporal_extent,
                          properties=properties)

NDSI - Normalized Difference Snow Index#

The Normalized Difference Snow Index (NDSI) is computed as:

\[ NDSI = \frac {GREEN - SWIR} {GREEN +SWIR} \]

We have created a Sentinel-2 data cube with bands B03 (green), B11 (SWIR) and the cloud mask (CLM). We will use the green and SWIR band to calculate a the NDSI. This process is reducing the band dimension of the data cube to generate new information, the NDSI.

green = s2.band("B03")
swir = s2.band("B11")
ndsi = (green - swir) / (green + swir)
ndsi

Creating the Snow Map#

So far we have a timeseries of NDSI values. We are intereseted in the presence of snow though. Ideally in a binary classification: snow and no snow. To achieve this we are setting a threshold of 0.42 on the NDSI. This gives us a binary snow map.

snowmap = ( ndsi > 0.42 ) * 1.0
snowmap

Creating a cloud mask#

We are going to use “SCL” band for creating a cloud mask and then applying it to the NDSI. 8 = cloud medium probability, 9 = cloud high probability, 3 = cloud shadow

Here is more information on the Scene Classification https://sentinels.copernicus.eu/web/sentinel/technical-guides/sentinel-2-msi/level-2a/algorithm-overview

Value	Label
0	NO_DATA
1	SATURATED_OR_DEFECTIVE
2	CAST_SHADOWS
3	CLOUD_SHADOWS
4	VEGETATION
5	NOT_VEGETATED
6	WATER
7	UNCLASSIFIED
8	CLOUD_MEDIUM_PROBABILITY
9	CLOUD_HIGH_PROBABILITY
10	THIN_CIRRUS
11	SNOW or ICE

scl_cube =conn.load_collection(
    "SENTINEL2_L2A",
    spatial_extent=spatial_extent,
    bands=["SCL"],
    temporal_extent=temporal_extent,
    max_cloud_cover=90,
    
)

scl_band = scl_cube.band("SCL")
cloud_mask = ( (scl_band == 8) | (scl_band == 9) | (scl_band == 3) ) * 1.0
cloud_mask

The SCL layer has a ground sample distance of 20 meter while the other bands have 10 meter GSD

Applying the cloud mask to the snowmap#

We will mask out all pixels that are covered by clouds. This will result in: 0 = no_snow, 1 = snow, 2 = cloud

snowmap_cloudfree = snowmap.mask(cloud_mask,replacement=2) # replacement is null by default
snowmap_cloudfree

Mask Polygon: From Bounding Box to Shape#

Filter to the exact outline of the catchment: this should mask out the pixels outside of the catchment.

catchment_outline['geometry'][0]

../../../_images/bc0a71e445da3d37f79aa0b283ea9804c6d6f3a3272c0f852a2a8b01f89a1208.svg

snowmap_cloudfree_masked = snowmap_cloudfree.mask_polygon(catchment_outline['geometry'][0])

Visualize one time step of the timeseries#

Let’s download the whole image time series as a netcdf file to have a look how our first results look like

snowmap_cloudfree_1d = snowmap_cloudfree_masked.filter_temporal('2022-02-10', '2022-02-12')
snowmap_cloudfree_1d.download('results/snowmap_cloudfree_1d.nc')

xr.open_dataarray('results/snowmap_cloudfree_1d.nc',decode_coords="all")[0].plot.imshow()

<matplotlib.image.AxesImage at 0x2a1cba4d390>

../../../_images/51c9b0e69a9e653385c3ba1523b8528307837ada7c0411b4283bb49a06e61253.png

Calculate Catchment Statistics#

We are looking at a region over time. We need to make sure that the information content meets our expected quality. Therefore, we calculate the cloud percentage for the catchment for each timestep. We use this information to filter the timeseries. All timesteps that have a cloud coverage of over 25% will be discarded.

Ultimately we are interested in the snow covered area (SCA) within the catchment. We count all snow covered pixels within the catchment for each time step. Multiplied by the pixel size that would be the snow covered area. Divided the pixel count by the total number of pixels in the catchment is the percentage of pixels covered with snow. We will use this number.

Get number of pixels in catchment: total, clouds, snow.

# number of all pixels
n_catchment = ((snowmap_cloudfree > -1) * 1.0).add_dimension(name="bands",type="bands",label="n_catchment")

# number of cloud pixels (no function needed, mask already created before)
n_cloud = cloud_mask.add_dimension(name="bands",type="bands",label="n_cloud")

# number of snow pixels
n_snow = ((snowmap_cloudfree == 1) * 1.0).add_dimension(name="bands",type="bands",label="n_snow")

# combine the binary data cubes into one data cube
n_catchment_cloud_snow = n_catchment.merge_cubes(n_cloud).merge_cubes(n_snow)

# aggregate to catchment
n_pixels = n_catchment_cloud_snow.aggregate_spatial(geometries = catchment_outline['geometry'][0], reducer = 'sum')
n_pixels

Create batch job to start processing on the backend.

# Create a batch job
n_pixels_json = n_pixels.save_result(format="JSON")
job = n_pixels_json.create_job(title="n_pixels_json")
job.start_job()

job.status()

'finished'

if job.status() == "finished":
    results = job.get_results()
    results.download_files("results_openeo_platform/")

Load the result. It contains the number of pixels in the catchment, clouds and snow.

We can calculate the percentages of cloud and snow pixels in the catchment.

with open("results_openeo_platform/timeseries.json","r") as file:
    n_pixels_json = json.load(file)

# check the first 5 entries
list(n_pixels_json.items())[:3] # careful unsorted dates due to JSON format

[('2022-03-09T00:00:00Z', [[4201607.0, 24.0, 1932487.0]]),
 ('2022-03-12T00:00:00Z', [[4201607.0, 3377768.0, 588345.0]]),
 ('2022-04-08T00:00:00Z', [[4166981.0, 3649978.0, 121573.0]])]

# Create a Pandas DataFrame to contain the values
dates = [k for k in n_pixels_json]
n_catchment_vals = [n_pixels_json[k][0][0] for k in n_pixels_json]
n_cloud_vals = [n_pixels_json[k][0][1] for k in n_pixels_json]
n_snow_vals = [n_pixels_json[k][0][2] for k in n_pixels_json]

data = {
        "time":pd.to_datetime(dates),
        "n_catchment_vals":n_catchment_vals,
        "n_cloud_vals":n_cloud_vals,
        "n_snow_vals":n_snow_vals
       }
df = pd.DataFrame(data=data).set_index("time")
# Sort the values by date
df = df.sort_values(axis=0,by="time")
df[:3]

	n_catchment_vals	n_cloud_vals	n_snow_vals
time
2022-02-02 00:00:00+00:00	NaN	4201607.0	NaN
2022-02-05 00:00:00+00:00	4201607.0	857590.0	2160594.0
2022-02-07 00:00:00+00:00	4166981.0	4023172.0	152057.0

Divide the number of cloudy pixels by the number of total pixels = cloud percentage

perc_cloud = df["n_cloud_vals"].values / df["n_catchment_vals"].values * 100
df["perc_cloud"] = perc_cloud
df[:3]

	n_catchment_vals	n_cloud_vals	n_snow_vals	perc_cloud
time
2022-02-02 00:00:00+00:00	NaN	4201607.0	NaN	NaN
2022-02-05 00:00:00+00:00	4201607.0	857590.0	2160594.0	20.411000
2022-02-07 00:00:00+00:00	4166981.0	4023172.0	152057.0	96.548844

Plot the timeseries and the cloud threshold of 25%. If the cloud cover is higher the timestep will be excluded later on.

Plot the cloud percentage with the threshold.

df.plot(y="perc_cloud",rot=45,kind="line",marker='o')
plt.axhline(y = 25, color = "r", linestyle = "-")
plt.show()

../../../_images/0e4e98a75dd901585fe6cf23b92b857f44e24c63f265524397753fbc9a5033d3.png

Divide the number of snow pixels by the number of total pixels = snow percentage

perc_snow = df["n_snow_vals"].values / df["n_catchment_vals"].values * 100
df["perc_snow"] = perc_snow
df[:3]

	n_catchment_vals	n_cloud_vals	n_snow_vals	perc_cloud	perc_snow
time
2022-02-02 00:00:00+00:00	NaN	4201607.0	NaN	NaN	NaN
2022-02-05 00:00:00+00:00	4201607.0	857590.0	2160594.0	20.411000	51.423039
2022-02-07 00:00:00+00:00	4166981.0	4023172.0	152057.0	96.548844	3.649093

Plot the unfiltered snow percentage

df.plot(y="perc_snow",rot=45,kind="line",marker='o')
plt.show()

../../../_images/db04e819b1f125d9e8148efa669f83226bd802a3e981b169f299e1207063b441.png

Keep only the dates with cloud coverage less than the threshold

df_filtered = df.loc[df["perc_cloud"]<25]

Plot the cloud filtered snow percentage

df_filtered.plot(y="perc_snow",rot=45,kind="line",marker='o')
plt.show()

../../../_images/7717b2ecf809fe916f928d84b9a72ea618437dfe4b9d9b3e01366e08f8ad9122.png

Save the cloud filtered snow percentage

df_filtered.to_csv("results_openeo_platform/filtered_snow_perc.csv")