Loading...

🕘 Meeting notes: ML Working Group

Jul 1, 2019

Attendees

@Noah B

Jim Bednar / Anaconda/ @jbednar

Stephan Rasp

Tom Augspurger

David

Agenda

What have we been up to in the last month?

Noah:

New Job at Vulcan

Reworked pre-processing pipeline to use zarr. Each chunk is a sample

Wrote a pytorch Loader.

Jim

pyviz-examples infrastructure

Writing new ML examples next

Stephan

Back to Lorentz model, online learning

tried binder for the first time (mybinder.org)

Tom

PR for shuffling dask arrays: https://github.com/dask/dask/pull/3901

Hyperband optimization (not bayesian) in dask: https://ml.dask.org/modules/generated/dask_ml.model_selection.HyperbandSearchCV.html#dask_ml.model_selection.HyperbandSearchCV

David

Some difficulties

First: Hurricane intensity (~4TB raw). Hurricane centered; winds, temp, dew point, multiple levels. 50,000-75,000 timesteps.

suggest using one big zarr file

Wants to train DL model using full dataset

Distributed training slow (IO bottleneck). Seems to affect NetCDF & Zarr, metadata reading is slow (milliseconds per sample)?

https://github.com/pydata/xarray/issues/2501

Using horovod for distributed training. Works well, but CNN error isn’t decreasing during training.

64 x 64 grid (smaller) isn’t an IO problem. Depends on size

Data Parallel partial dependence plots (using dask)

Ran into bottleneck with TF. Hard to distribute TF models across processes.

Maybe check if can start dask workers without a nanny

Maybe try --n-threads=1 for each dask worker

threads_per_worker, n_workers, nanny for LocalCluster

Cloud environments?

gotchas for docker + deep learning:

use nvidia docker images

k8s: install nvidia cuda driver (look at GCP documentation)

issues with firewall at NOAA

Blog post

Next meeting time?

August 5 9AM

Action items

Still write blog post @Noah B @Tom

+Pangeo ML WG Blog Post