GridTools Extents and Horizontal Ifs
When GridTools C++ (the code generated by gt4py) executes a statement like this (ok, technically the code mapped to from this gt4py statement):

tmp = in_field[1, 0, 0]

with tmp a temporary field and in_field an input “API” field, a number of things are happening under the hood…

Execution policy

This is how threads are mapped to the IJ points in the plane. It usually looks like a sub-divided cartesian plane into smaller “blocks”, e.g. 4x4 for a 16-core processor. Here is a compute domain that is 8x8 with 4x4 blocks.
We do not know what size GridTools will use, and in fact it will never expose that implementation detail because it is not even a constant.

Extents

There are three types of “extents” that are related, but different:
  1. Access offset: This is what fields are accessed using in the frontend e.g. [1, 0, 0] implies that it accesses one element forward in the I-direction.
  1. Stage: This is the basic compute offset for the entire block group. If tmp = in_field[1, 0, 0] has zero stage extent by itself. However, if another statement would be added after out_field = tmp[1, 0, 0] then the first stage is “extended” so that there is not an uninitialized read into the tmp field.
  1. Execution block: This is not something we have dealt with directly before, but it’s an important concept for filling temporaries across “blocks”. For tmp = in_field[1, 0, 0] the execution extent is zero, since tmp is not read with an offset. Said differently: there is no accumulated extent.

Temporary allocation

Temporary are allocated per-block. That means since there are no execution extents in the example above, that tmp is (by default without extents) a 4x4xK piece of memory for each block.

Multi-statement example with an extent


Consider another case:
tmp = in_field            # execution extent = stage extent = [1, 0, 0]
out_field = tmp[1, 0, 0]  # execution extent = stage extent = [0, 0, 0]

First let’s pretend there is no execution extent. If there was no concept of an execution extent, how would the last I-thread in each block get the value of tmp? Answer: It would read out of bounds or read a zero. That is why the execution extent is so important here: it tells GridTools that the previous statement setting tmp needs to be extended by one so that it has the memory to read later.

The execution extents are shown in the diagram below on top of the origin diagram above.
What happened here? the execution extent carried forward on all blocks implied a “stage extent” on the previous statement, since the last “block” in each row has an extent over the end of the natural compute domain in the I-direction. This is the relation between execution extents and stage extents: the former implies the latter when statements are executed everywhere.

Relation to Horizontal if’s

Let’s add a horizontal if. I’m going to use the technically not allowed gt4py syntax here because it makes the example shorter and easier to understand (in general we should use a function for this):
tmp = func(in_field) # computed with extent [1, 0, 0]
with horizontal(region[1, 3]):
  tmp = tmp[0, 1, 0] # extent = [1, 0, 0] no need to extend stage
out = tmp[1, 0, 0]  # extent = [0, 0, 0] tmp needs to be computed on [1, 0, 0]
So at the blue dot we’re writing  tmp from the red dot.
If there is no execution extent on the first line tmp = func(in_field) then there is no memory to read at the red dot from the SW block.

As a result, we add a execution extent
The most important takeaway from all this is that execution extents are applied uniformly on all blocks in the API that GridTools provides to GT4Py. There is no way to apply only to the first block, because GridTools does not have a concept of execution extent location or low-level block size details (which we would need to only apply the execution extent on the SW block).

As a result, what we are currently doing is saying this execution extent is zero, since the access does not extend the compute domain. That is wrong however because it does not trigger the extension on the block for temporaries. This is why we’ve been able to get by sometimes so far: if only API fields are involved, there is no synchronization issue.

So, let’s assume we’ve applied the execution extent to all blocks which implied a stage extent. Oops! Now we can no longer run our stencil over the same stencil compute domain, because there is a J-extent on field, from 

tmp = func(field)

so the incorrect stage extent (from the execution extent on tmp) is transferred to field. We’d have to run it on (8, 8, K) - (0, 1, 0) size (assuming origin=(0, 0, 0))

This behavior breaks other code/the gt4py model, so is unacceptable.

tmp = func(in_field)  # execution extent = [1, 1, 0] (SW block), stage extent = [0, 0, 0]
with horizontal(region[1, 3]):
  tmp = tmp[0, 1, 0]  # execution extent = [0, 0, 0], stage extent = [0, 0, 0]
out = tmp[1, 0, 0] # execution extent = [0, 0, 0], stage extent = [0, 0, 0]

tmp = func(in_field) # computed with extent [(0, 1), (-1, 0), (0, 0)]
with horizontal(region[0, 0]):
  tmp = tmp[0, -1, 0] # extent = [1, 0, 0] no need to extend stage
out = tmp[1, 0, 0]  # extent = [0, 0, 0] tmp needs to be computed on [1, 0, 0]

Possible solutions

”locate” the execution extent to a point, and have GridTools only extend the computation on that block. Something like this:

Now the stencil compute domain is intact.

Unfortunately there’s literally no way that with the existing GridTools implementation we can express this. The paths forward we have are:
  1. Ask the GridTools team to implement extensions to their library/DSL that the backends write to to have a special set of extents that are either:
  1. located at a compute point
  1. min limiter: The extent we have here is inside the domain, so it should only apply to the execution extent, not the implied stage extent at the end of the compute domain (said differently, the NW block should not extend as well to the North).
  1. Implement this hook at a lower level in one of our own backends (I recommend going for option (b) above in our custom backend).
  1. If we can control the block size, try to find a block size that never reads across a boundary.
  1. GridTools has temporaries across multistages, why not use this?

Non-solutions (possible paths forward)

  1. Roulette.
  1. Restrict use to API fields only.