Matplotlib 4.0

THIS IS STILL A WORK IN PROGRESS

Roadmap

  1. embrace structured data (without depending on an implementation)
  1. enable / encourage / nurture / harmonize third-party domain-specific extensions
  1. carry on with incremental improvements / bug fixes / performance improvements

Overview

A

Matplotlib is the de facto standard plotting library in the scientific Python ecosystem.  Over the last 16 years the library has organically grown to cover a wide range of use-cases.  Further, over the past decade there have been significant advances in data structures and data access that Matplotlib does not currently exploit.  We propose building on the established code base by codifying conventions for Matplotlib extensions, so domain-specific plotting tools will compose well with core Matplotlib and one another, unifying the internal data abstractions, to both better support modern structured data and to ease development of responsive, interactive, and streaming plots, and unify the way we internally encode plot properties to enable exporting to other plotting tools, such as javascript libraries or openGL.  With this work Matplotlib will continue to be an integral component of the scipy stack for the next 15 years.

B


Matplotlib is a mature widely used and highly impactful library used across a wide range of science [LIGO, EVHT, NSLS-II, CellProfiler, cartopy, XX% of arxiv] and industry [used at google, bloomberg, MSFT, …].  It has developed organically over the last 16 years; however to continue to be impactful for the next 16 years we need to adapt.

One of the biggest changes from when Matplotlib started to now is the development of structured and streaming data.  Although Matplotlib has some support [data kwarg], fixed-size array-like data structures that primarily hold numeric data are the fundamental primitives of the API.  The need for structured-data aware primitives is shown by some data structures having their own built-in plotting (xarray and pandas) and third-party libraries (ex seaborn) that have structured data as their primitive inputs. 

The end goal is to enable highly-tuned domain specific plotting tools to be easily built.

Architecture 


The architecture of Matplotlib [AAOSP link] can be thought of as having three layers
  1. The user-facing API (either pyplot or the OO API)
  1. The Artist representation
  1. The backends
Roughly speaking, the users and library authors use (1) to express to Matplotlib what data they have and  how they want it plotted which generates the Python objects of (2).  To render the figure, either to a GUI or file, the Artists from (2) are passed to layer (3) which uses them to render the final output.  

We propose:
  • adding a new Data layer to abstract over data storage and access
  • extending the Artist layer to have richer semantic artists
  • extending the backend layer to provide an export entrypoint

Data Model (data structures)

Currently when users pass in data it is (sometimes) transformed and then stored across one or more  Artists and numpy-array likes.   While this is easy to implement and has been very successful it has several drawbacks
  • each Artist stores the data in a slightly different place 
  • common processing is done in many places (ex, unit handling, masked data, scaler / vector processing)
  • if multiple Artists involved, can become decoupled
  • do not keep raw data to allow re-processing (ex, hist, contour)
To that end we propose to develop a Matplotlib data layer that will handle these details.

Requirements

  • be able to be shared across Artists
  • handle units
  • handle smart down/upsampling of data
  • handle updates to data
  • handle streaming data

use cases

  • data-shader style binning for large data sets
  • ‘smart’ resampling based on data limits of lines
  • Native support for non-numeric datatypes (like panda extension types)
  • Robust support of units