Data Formats and Performance

Data Formats and Performance#

Learning Objectives#

Understand what cloud native data formats are
Understand how the cloud does computing more efficiently

Outline#

Cloud Native Data Formats
- Why do we need them
- What is there advantage over classical file formats
- Not one format to rule them all.
- Examples
Performance in the cloud
- Tiling
- Scaling
- Distributed Computing
The Microsoft Planetary Computer Setup - State of the Art open source cloud native technology stack
- ODC, STAC, xarray, dask, cogs, …

Cloud native data formats#

What are cloud native data formats#

Cloud native formats or cloud-optimized formats, are file formats specifically designed to optimize the storage, access, and processing of geospatial data in cloud computing environments. These formats are tailored to leverage the scalability, flexibility, and parallel processing capabilities of cloud infrastructure, enabling efficient handling of large-scale datasets.

Video content in cooperation with Aimee Barciauskas (DevelopmentSeed) and Ryan Avery (DevelopmentSeed).
“Cloud-optimised means organizing so subsets of data can be read. Ideally, the data is also compressed. Both of these factors minimize the amount of data that has to be transferred across a network.”

Characteristics of cloud native data formats#

Cloud-optimized means mainly optimized “read” access with partial reads and also parallel reads. Main characteristics common for cloud-optimized formats:

Data Chunking: Cloud native formats employ a chunk-based organization, where the data is divided into smaller chunks or blocks. This enables parallel processing and efficient retrieval of specific portions of the data, reducing the need to access the entire dataset.
Internal Indexing: These formats incorporate internal indexing structures that facilitate fast spatial and attribute queries. This enables efficient data access and retrieval operations without the need for extensive scanning or processing of the entire dataset.
Metadata Optimization: Cloud native formats optimize metadata storage and indexing, allowing for efficient access and retrieval of metadata associated with the data at once. This supports faster discovery and interpretation of data properties and characteristics.
Compression and Tiling: Cloud native formats often employ advanced compression techniques to reduce storage requirements while maintaining data quality. Additionally, they utilize tiling strategies to divide the data into smaller, manageable tiles that can be independently accessed and processed.

HTTP Range Request allows clients to request only a specific portion or range of data instead of a complete dataset.

Examples of cloud native data formats#

COG - Cloud-Optimized GeoTIFF (COG) is an optimized version of the GeoTIFF format. It organizes raster data into chunks, utilizes internal tiling and compression, and uses HTTP range requests for efficient data access in the cloud.
ZARR Zarr is a format specifically designed for storing and accessing multidimensional arrays. It supports chunking, compression, and parallel processing, making it suitable for large-scale geospatial datasets, for example, weather data. Metadata is stored externally in data files itself.
FlatGeoBuf Cloud optimized vector data format. It is a binary encoding format for geodata and holds a collection of Simple Features.

Available Material#

Ryan Avery, Aimee Barciauskas, Development Seed, United States (2023). Technologies used to Create, Store and Access Geospatial Data in the Cloud. https://2023.ieeeigarss.org/view_paper.php?PaperNum=5306
ESIP Talk on Cloud Native Formats: https://www.youtube.com/watch?v=ac_UKunUrNM
FOSS4G Talk On Cloud Native Formats (Matthew Hanson)
- https://talks.osgeo.org/foss4g-2023/talk/XBHYF9/
- https://space.cloud68.co/s/xExLwCmzzKEcoB9?dir=undefined&path=%2FLumbardhi%2F28.06.2023&openfile=1356611
OGC White Paper on Cloud Native Formats (Chris Holmes, Scott Simmons):
Cloud-Native Geospatial Foundation initiative of Radiant Earth:
- https://cloudnativegeo.org
Tweet from Chris Holmes (great example - postholer)

Scaling#

Show concepts of scaling
Mention resource usage on cloud platforms. Track resources. Every processing has it’s cost.

Scaling refers to the process of increasing or decreasing the capacity or size of a system to handle a larger or smaller workload or data volume. Scaling does not necessarily means only in the direction of larger and bigger but also saving unnecessary costs in times when there is no traffic. In our context, scaling involves managing the growth of data, traffic, or processing requirements to ensure optimal performance and availability.

Horizontal vs vertical scaling#

Two classical approaches to scaling are horizontal and vertical:

Horizontal scaling: Also known as scaling out, horizontal scaling involves adding more machines or nodes to distribute the workload across a larger number of resources. This could mean adding more servers or instances to handle increased traffic or data processing demands. Horizontal scaling offers the advantage of improved fault tolerance and increased capacity, as the workload is spread across multiple resources.
Vertical scaling: Also known as scaling up, vertical scaling involves increasing the capacity of an individual machine or resource. This can be achieved by upgrading the hardware, such as adding more powerful processors, memory, or storage, to handle the growing demands of the geospatial application. Vertical scaling is often suitable for applications with single-node architectures or when the workload cannot be easily distributed across multiple machines.

Both horizontal and vertical scaling have their advantages and considerations. Horizontal scaling provides better scalability and fault tolerance, as it can handle increased traffic and processing by adding more resources. However, it may require additional effort to distribute and synchronize data across multiple nodes. Vertical scaling, on the other hand, simplifies data management as all resources are contained within a single node, but it may have limitations in terms of the maximum capacity a single machine can handle.

In common workflows, a combination of both approaches is used to ensure optimal speed and resource utilization while being able to keep the simplicity of a workflow.

How to scaling#

There are many approaches how to handle scaling properly.

todo: parallel computing section

Subscription vs. On-Demand usage#

Subscription: A subscription model involves a fixed, recurring payment made by the user to access and utilize the cloud platform’s services over a specific period, typically monthly or annually. Under a subscription model, users typically commit to a predetermined level of resources and pay a regular fee regardless of the actual usage during that period. This model often provides cost predictability and may offer discounts or benefits for long-term commitments. Users can usually choose from various options and combinations of resources (eg. GPU, CPU, disk storage combinations).

Advantages of the Subscription Model:

Cost Predictability: Users have a clear understanding of the ongoing costs as they pay a fixed fee.
Potential Cost Savings: Subscriptions may offer discounts or cost benefits for longer-term commitments.
Continuous Service Access: Users have continuous access to the subscribed services without the need for frequent renewal or payment management.

On-Demand Usage: In an on-demand model, users pay for the cloud platform’s services based on actual usage and consumption. Users are charged on a pay-as-you-go basis, where they pay for the resources or services they utilize in a given period. There are no long-term commitments or fixed fees. This model offers flexibility and scalability, allowing users to scale resources up or down as per their needs.

Advantages of On-Demand Usage:

Flexibility: Users have the flexibility to adjust resources based on their varying demands, scaling up or down as required.
Cost Efficiency: Users only pay for what they use, making it suitable for workloads with unpredictable or fluctuating resource needs.
No Long-Term Commitments: On-demand models do not require users to commit to a specific period or predefined resource levels, allowing for agility and quick adjustments.

Choosing between subscription and on-demand models depends on various factors, including the nature of your workloads, budget considerations, and usage patterns. Based on this (and data availability) users can choose a platform that suits their needs best. Reviewing the pricing details is an important part before selecting a working environment.

Cost of scalability#

Direct examples of computing on a workflow (todo: based on actual workflow)

Memory consumption#

limitations and

Difference between platform usage and cloud directly#

TODO: Is this covered already in the platform lesson? Yes Using platforms removes complexity and adds abstraction layers

What to avoid and what are the limitations#

While scaling is providing many options and is essential for achieving results on a larger scale, there are some limitations to keep in mind and activities to even avoid.

Costs: One of the main characteristics to consider are costs of computing. Scaling resources dynamically can lead to increased costs, especially if not properly managed. It is essential to monitor resource usage and set appropriate maximum scaling policies to ensure cost optimization. Failure to do so may result in unnecessary provisioning of resources, leading to higher expenses. The purchase of many computational resources can be easy, but very costly. Code optimization is important to ensure there are no memory leaks, unnecessary data storage, and other expensive operations.

Data Access: In geospatial cloud workflows, one of the big challenges lies in data access and optimal data storage. The easy trap is in loading large portions of unnecessary data without applying correct filters ahead. Such data volumes can lead to more requirements on RAM or disk space resulting in higher costs of processing or longer times (and more computational time) just to load the data.

Accessing data as files: Geospatial data are stored in many formats as discussed in this lesson and some are more appropriate to access in the cloud. The opportunity of first evaluating metadata before loading the whole dataset is great for saving time and money.

Latency and Data Transfer: In distributed and scaled-out architectures, managing data transfer and minimizing latency can be crucial. Moving data between services or instances across different locations can introduce network overhead and impact application performance. Efficient data caching, or data partitioning strategies can help mitigate these challenges.

Scaling Limits in the platform: While cloud platforms offer high scalability, there are still practical limits to consider. Every service or resource has its scalability limits, such as maximum instance count, storage capacity, or network throughput. It is important to understand these limitations and design your programs and applications accordingly.

To mitigate these challenges and limitations, it’s advisable to thoroughly plan and architect your application for scalability, leverage cloud-native tools and services, monitor resource usage and costs, and regularly test and optimize your scaling strategies. Additionally, staying updated with the latest advancements in cloud technologies and best practices will help you navigate the complexities of cloud-native scaling more effectively.

Animated Content: Tiling and application (drag and drop)#

Quiz#

What are cloud native data formats in EO and GIS

[(X)] Cloud native data formats in EO should be compatible to cloud services (APIs, http requests, cloud storage), enable fast viewing and access to spatial sub regions.
[( )] Cloud native data formats in EO are exclusively designed to be compressed as much as possible, so that the least amount of storage space is necessary.
[( )] Cloud native data formats in EO have to be human readable when you open them in a text editor.

Connect the cloud native data format and it’s predecessor to it’s spatial data type (https://www.ogc.org/blog-article/towards-a-cloud-native-geospatial-standards-baseline/, (https://cholmes.medium.com/an-overview-of-cloud-native-vector-c223845638e0)

[[raster] [n-dimensional raster] [vector]]
[( ) ( ) (X) ] Shapefile
[( ) (X) ( ) ] Zarr
[( ) ( ) (X) ] GeoParquet/Arrow
[( ) ( ) (X) ] Flatgeobuff
[(X) ( ) ( ) ] GeoTIFF
[( ) (X) ( ) ] NetCDF
[(X) ( ) ( ) ] Cloud Optimized GeoTIFF (COG)

Lazy-loading is essential when working with large data collections, why?

Compression Quiz#

Consider a cloud provider, that should decide if compressing the data on its storage would let it spare money.

If the compression process of a COG takes 0.05 CPU hour every 1 GB of data, the total amount of COGs on the storage takes 1200 TB and one CPU hour costs 0.05€ for the first 10000 CPU hour and then 0.03€ for the rest, how much would the compression process cost?

[( )] 3000€
[( )] 1200€
[(X)] 2000€
[( )] 2043€

Solution:

1 TB = 1000 GB following the international standard, see https://en.wikipedia.org/wiki/Gigabyte.
The amount of data on the storage is 1200 TB * 1000 GB/TB = 1200000 GB
If to compress 1 GB of data takes 0.05 CPU hour, to compress 1200000 GB it will take: 1200000 GB * 0.05 CPU hour / GB = 60000 CPU hour
The first 10000 CPU hour cost 10000 CPU hour * 0.05 €/CPU hour = 500€
The rest is 50000 CPU hour and it will cost 50000 CPU hour * 0.03 €/CPU hour = 1500 €
The total is 500 € + 1500 € = 2000 €

In the same scenario, now we also consider the storage maintanance cost. If each TB of storage has a maintenance cost of 0.5 € every month, how much time would it take before the compression would let the cloud provider spare money? Consider a compression rate of 70%.

[( )] ~ 9 months
[( )] ~ 10 months
[(X)] ~ 11 months
[( )] ~ 12 months