Loading...

Pangeo ML WG Blog Post

Newer outline

Title: ML Patterns for Geo-sciences

Author:

Introduction

Why patterns vs libraries?

flow-chart of typical ML process?

Data preparation (one article)

Depends on where the data is produced

Chunking depends on sampling strategy

Chunking good for ML training is not necessarily good for evaluation

column-based

image-like

Evaluation is more complicated (another)

multivariate (temperature, humidity, etc)

spatial structure (even if training is column-based)

ML metrics not necessary computed easily on the training data

training data is good for ML training!

Sometimes need to couple ML to

Machine workflows (another article)

Outline

Data Loading

Potential bottlenecks (need data locality for each step to work well)

HD to main memory

main memory to CPU caches

main memory to GPU memory

Static model outputs

open_mfdataset is still slow

Streaming processing

Storage formats:

row-based vs column based

How much/how to preserve metadata (e.g. variable names, coordinate info, units) as data flows to ML algorithms

Shuffling

Coupling to climate models

copy-free semantics needed

easy of use/flexibility is key

can python be used efficiently?