Loading...

Partitions

Design Decisions

How are partitions represented to solids?

config / environment dict as configured by user

context resource

How do we group runs for the same partition?

tags

How do we group runs across partitions for the same pipeline?

tags

Where do we need to specify the set of possible partitions?

pipeline definition vs schedule definition vs standalone

Should shoot for standalone… should get the schedule definition right, and then work on hacking around the dagit UI

How do we specify execution of a partition?

tags?

presets?

Where do we do partition selection?

explicit selector on schedule definitions

presets

TODO

Figure out how to resolve config/environment dict based on partition

Look at presets?

Figure out execution API for partition

for now, using tags

Figure out if we should allow non-partition jobs for partitioned pipelines, where should the partition selection happen

Requirements

Partition requirements:

Support time-based partitions (variable-sized)

Support fixed partitions (ML-style)

Execution requirements:

Support execution of a partition through dagit

Support execution of a partition through a scheduler

Support execution of a batched set of partitions (backfill)

Run requirements:

Take in config to designate partition

UI requirements:

View job/run status by partition

Approaches

Pipeline partitions

Add partition definition function on a PipelineDefinition

Pros:

Natural for pipeline author to already know how to partition data

Cons:

Potentially overfitting by making Partition too prominent

ScheduleType

Put partition definition function on a ScheduleDefinition

Pros: