🕘 Meeting notes: ML Working Group

Jul 1, 2019


Attendees

​@Noah B​ 
Jim Bednar / Anaconda/ @jbednar
Stephan Rasp
Tom Augspurger
David 

Agenda

  • What have we been up to in the last month?
  • Noah:
  • New Job at Vulcan
  • Reworked pre-processing pipeline to use zarr. Each chunk is a sample
  • Wrote a pytorch Loader.
  • Jim
  • pyviz-examples infrastructure
  • Writing new ML examples next
  • Stephan
  • Back to Lorentz model, online learning
  • tried binder for the first time (mybinder.org)
  • Tom
  • David
  • Some difficulties
  • First: Hurricane intensity (~4TB raw). Hurricane centered; winds, temp, dew point, multiple levels. 50,000-75,000 timesteps.
  • suggest using one big zarr file
  • Wants to train DL model using full dataset
  • Distributed training slow (IO bottleneck). Seems to affect NetCDF & Zarr, metadata reading is slow (milliseconds per sample)?
  • Using horovod for distributed training. Works well, but CNN error isn’t decreasing during training.
  • 64 x 64 grid (smaller) isn’t an IO problem. Depends on size
  • Data Parallel partial dependence plots (using dask)
  • Ran into bottleneck with TF. Hard to distribute TF models across processes.
  • Maybe check if can start dask workers without a nanny
  • Maybe try --n-threads=1 for each dask worker
  • threads_per_worker, n_workers, nanny for LocalCluster
  • Cloud environments?
  • gotchas for docker + deep learning:
  • use nvidia docker images
  • k8s: install nvidia cuda driver (look at GCP documentation)
  • issues with firewall at NOAA
  • Blog post
  • Next meeting time?
  • August 5 9AM

Action items

  • Still write blog post ​@Noah B​ ​@Tom​