02h — Data clean (HEALPix-NESTED nside=64 substrate)

Phase C — substrate-robustness branch. This notebook is a thin orchestration wrapper around the four cleaning + indexing scripts in healpix_port/, mirroring the existing CEA pipeline (02_data_clean.py) but on the HEALPix nside=64 NESTED substrate instead of Soroye’s 100 km cylindrical-equal-area grid.

At nside=64 the HEALPix cell area is ~0.84 deg^2 (~92 x 92 km), scale-matched to the CEA cell width Soroye fitted on. The two substrates are independent equal-area tessellations of the sphere; if sc_TEI_delta recovers at the same magnitude on both, the mechanism is shown to be substrate-robust (a stronger contribution than pure CEA-on-CEA reproduction).

Per DOMAIN.md, HEALPix indexing is always NESTED (hierarchical bit-shift refinement: parent = pix >> 2, children = (pix << 2) | k). We use healpix-geo, not healpy, because healpy’s cosmology-first lon convention (0-360) and absent CRS handling accumulate small biases that are unacceptable for biodiversity-precision climate-impact work over Iberia.

Scripts run in order:

01_clean_data_iberia_healpix.py — clean GBIF Iberia + Kerr 2015 species filter + HEALPix nside=64 cell assignment.
02_presence_absence_healpix.py — per-(species x cell x period_season) presence/absence on the flat Iberia HEALPix cell list.
03_sampling_continent_healpix.py — per-cell sampling-effort raster (distinct LYIDs) + continent code (constant 2 / Europe).
04_climate_tei_pei_healpix.py — bilinear-sample CRU TS 3.24.01 at HEALPix cell centres + compute TEI / PEI per species per cell per period.

import os
import subprocess
import sys
from pathlib import Path

ROOT = Path("..").resolve()
PORT = ROOT / "healpix_port"
OUT_DIR = PORT / "outputs_iberia"

env = {**os.environ}


def run(script: str) -> None:
    print(f"\n=== {script} ===", flush=True)
    subprocess.run(
        [sys.executable, script],
        cwd=PORT,
        env=env,
        check=True,
    )

Run the four upstream cleaning + indexing scripts (HEALPix substrate)¶

run("01_clean_data_iberia_healpix.py")
run("02_presence_absence_healpix.py")
run("03_sampling_continent_healpix.py")
run("04_climate_tei_pei_healpix.py")

Summary of intermediate artefacts produced¶

expected = [
    OUT_DIR / "bombus_clean_healpix.csv",
    OUT_DIR / "presence_absence_healpix.nc",
    OUT_DIR / "sampling_continent_healpix.nc",
    OUT_DIR / "climate_tei_pei_healpix.nc",
]
print("\nIntermediate artefacts (HEALPix nside=64 NESTED):")
for p in expected:
    if p.exists():
        size = p.stat().st_size
        print(f"  ok    {p.relative_to(ROOT)}  ({size:,} bytes)")
    else:
        print(f"  MISS  {p.relative_to(ROOT)}")

import pandas as pd  # noqa: E402

import xarray as xr  # noqa: E402

clean_csv = OUT_DIR / "bombus_clean_healpix.csv"
if clean_csv.exists():
    df = pd.read_csv(clean_csv)
    print(f"\nbombus_clean_healpix.csv -> {len(df):,} rows, "
          f"{df['species'].nunique()} species")
    print(f"  HEALPix cells with at least one occurrence: "
          f"{df['cell_id_hp'].nunique()}")
    print(f"  periods present: {sorted(df['period'].dropna().unique().tolist())}")
    print(f"  seasons present: {sorted(df['season'].dropna().unique().tolist())}")

pa_nc = OUT_DIR / "presence_absence_healpix.nc"
if pa_nc.exists():
    pa = xr.open_dataset(pa_nc)
    depth = pa.attrs['dggs_grid_refinement_level']
    print(
        f"\nIberia HEALPix substrate: {int(pa.sizes['cells'])} cells "
        f"(nside={2**depth}, depth={depth}, "
        f"scheme={pa.attrs['dggs_grid_indexing_scheme']!r})"
    )