Partitions

Design Decisions

  • How are partitions represented to solids?
  • config / environment dict as configured by user
  • context resource
  • How do we group runs for the same partition?
  • tags
  • How do we group runs across partitions for the same pipeline?
  • tags
  • Where do we need to specify the set of possible partitions?
  • pipeline definition vs schedule definition vs standalone
  • Should shoot for standalone… should get the schedule definition right, and then work on hacking around the dagit UI
  • How do we specify execution of a partition?
  • tags?
  • presets?
  • Where do we do partition selection?
  • explicit selector on schedule definitions
  • presets

TODO

  • Figure out how to resolve config/environment dict based on partition
  • Look at presets?
  • Figure out execution API for partition
  • for now, using tags
  • Figure out if we should allow non-partition jobs for partitioned pipelines, where should the partition selection happen

Requirements

Partition requirements:
  • Support time-based partitions (variable-sized)
  • Support fixed partitions (ML-style)

Execution requirements:
  • Support execution of a partition through dagit
  • Support execution of a partition through a scheduler
  • Support execution of a batched set of partitions (backfill)

Run requirements:
  • Take in config to designate partition

UI requirements:
  • View job/run status by partition

Approaches

Pipeline partitions

Add partition definition function on a PipelineDefinition
  • Pros:
  • Natural for pipeline author to already know how to partition data
  • Cons:
  • Potentially overfitting by making Partition too prominent

ScheduleType

Put partition definition function on a ScheduleDefinition
  • Pros: