COG and Zarr for Geospatial Data
Note: Let’s use this document to try explaining the difference between COG and Zarr. What are both format, when or when not to use them, what tools supports them…
COG aka Cloud Optimized GeoTIFF
**describe COG format with simple wording**
Bonus: Example of available dataset (Landsat, Sentinel 2, Sentinel 1 …)
The Zarr format is a container for dense ND-dimensional array data. It’s scope is similar to HDF5 and TileDB—much broader than just geospatial data. The main motivation for creating Zarr was to have a simple, transparent, open, and community-driven format which supports high-throughput distributed I/O on different storage systems. Zarr was originally created for use in genomic analysis but has since found applications in a wide range of scientific domains including geospatial rasters.
The core object in Zarr is an array. An array is an N-dimensional numerical array with metadata. The array is split into user-specified regular chunks. Each chunk is optionally compressed and stored as an individual file / object whose name describes the chunk’s position within the array (e.g. 2.1 ). System metadata (data type, array shape, chunks, compression parameters, etc.) are stored in one JSON file (.zattrs), while arbitrary user metadata are stored in another (.zmeta). A key attribute of this format is its simplicity, which makes it relatively straightforward to implement native readers / writers from any programming language.
Zarr Arrays can be stored together in a tree hierarchy as a Zarr Group. In addition to Array metadata, there can be group-level user metadata.
Zarr data can be stored in any storage system that can be represented as a key-value store. The most common storage system is a POSIX filesystem: the keys are just filenames. It also maps easily to cloud object storage (i.e. AWS S3; keys are object names). To assist with portability, stores can be zipped as a standard zip file, which (unlike gzip) supports reading of individual keys without decompressing the entire file, either from a file or, via HTTP Range requests, from object storage. Zarr also supports more exotic storage backends such as relational (SQL) databases, document databases (MongoDB) and key-value databases (Redis). Implementing a new storage layer is generally as simple as writing a key-value adaptor for it. This sort of hackability is a feature, not a bug, of Zarr, but it does come into conflict with the goal of interoperability; not every language’s implementation can support every possible store.
When storing Geospatial data in Zarr, it is very common to adhere to NetCDF / CF metadata conventions.
Compared to COG, key differences are
- Zarr supports arbitrary dimensions, while COG supports only two (spatial dimensions). This means that an entire [hyper]cube of data can be stored in a single Zarr Array, while the same data stored in COG would require many distinct, standalone files. To mitigate this, COG couples closely with STAC to catalog larger collections of data. Many aspects of STAC (e.g. “items”) are unnecessary when working with Zarr.
- Zarr supports arbitrary chunking, while COG has the same internal grid within one file
- Optimal chunking depends on the use-case.
- regular updates → time chunking = 1
- reading large areas → spatial spatial chunking
- reading time series for point → large time and small spatial chunking
- Zarr, being designed for a broader scope than geospatial, has no specific requirement in terms of SRS/CRS. Often SRS is not listed at all (default = EPSG:4326?) or is written at some non-standard place in the meta-data.
- Zarr has no native support for multi-scale (i.e. pyramid, or multi-resolution bands such as Sentinel-2) data. All layers are at the same resolution. There are proposals in place to address this, but nothing yet agreed.
- Zarr has no requirement for equidistant coordinates - while this is common for time dimension, it may cause complexity for spatial coordinates (if used)
Bonus: Example of available dataset
https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.1/6001240.zarr (See https://www.openmicroscopy.org/2020/11/04/zarr-data.html for viewing) Formatted according to https://ngff.openmicroscopy.org/ to make a “multiscale image” out of raw Zarr.
Papers about Zarr and performance / scalability: