Hi,

We notice the uneven distribution of shards over subtasks after
re-sharding. We believe that our use case can be addressed by sorting
shards and assigning them to subtasks by index, with caveats.

The main problem will be that the shard-subtask mapping won't be
deterministic, while current hash based solution is (but causes skew).

Possibly the trade-off will be difficult to overcome (for a generalized
solution) without centralizing the shard assignment, which would in turn
require something like side inputs.

Any opinions on this? Would it be acceptable to make changes to the
existing operator that make the shard assignment logic and hashing easier
to customize?

Thanks,
Thomas

Reply via email to