Pangeo ML WG Blog Post

Newer outline

Title: ML Patterns for Geo-sciences
Author:
  • Introduction
  • Why patterns vs libraries?
  • flow-chart of typical ML process?
  • Data preparation (one article)
  • Depends on where the data is produced
  • Chunking depends on sampling strategy
  • Chunking good for ML training is not necessarily good for evaluation
  • column-based
  • image-like
  • Evaluation is more complicated (another)
  • multivariate (temperature, humidity, etc)
  • spatial structure (even if training is column-based)
  • ML metrics not necessary computed easily on the training data
  • training data is good for ML training!
  • Sometimes need to couple ML to 
  • Machine workflows (another article)

Outline

  • Data Loading
  • Potential bottlenecks (need data locality for each step to work well)
  • HD to main memory
  • main memory to CPU caches
  • main memory to GPU memory
  • Static model outputs
  • open_mfdataset is still slow
  • Streaming processing
  • Storage formats:
  • row-based vs column based
  • How much/how to preserve metadata (e.g. variable names, coordinate info, units) as data flows to ML algorithms
  • Shuffling
  • Coupling to climate models
  • copy-free semantics needed
  • easy of use/flexibility is key
  • can python be used efficiently?