Dropbox Paper

Storm v3 design: Second iteration

After experimenting a bit, i realized that some of Storm v2’s features may not be a good fit for the next version and that i should rethink whether we should keep some of them or not.

Codecs are one of them. Let’s experiment a design without them.

Writes

To write something in a bucket, we need a Schema. It describes all the fields and their respective types for a specific bucket. Schemas are not tied to a specific data structure, they can be generated from a struct and used with a map and the other way around.

Using a Schema we can reason in terms of table and store each field separately.

Since the Schema is typed, we can encode each field using the right method:

string → []byte(s)

int64 → binary.PutVarint

etc.

type Schema map[string]*Field

type Field struct {

Name string

Type FieldType // ex: "enum" of all the supported types?

}

The problem is that a bucket has one dimension, while a table has two.

Using one bucket per “column” would add too much overhead though, what’s missing is the concept of a row.

Using Bolt’s NextSequence feature, we can generate row ids to:

virtually add a new dimension within a bucket

uniquely identify each row

group fields that belong to the same row

ensure all the fields are contiguous within a bucket

key: rowID + '-' + fieldName

value: byte representation of the field value

// example (pseudo data)

1-Name: "john"

1-Age: 2

2-Name: "jack"

2-Age: 29

...

Reads

That’s where all the benefit of reasoning with tables lies.

Decoupling the data from the data structure (struct or map) allows us to avoid reflection on low level parts of the code.

But the most interesting part comes next: We can now create a read pipeline that can transform ”tables” into other in memory tables.

type Table interface {

Next() (Row, error)

Schema() (*Schema, error)

}

​​Writes

​​Reads

Writes

Reads