Hi, We notice the uneven distribution of shards over subtasks after re-sharding. We believe that our use case can be addressed by sorting shards and assigning them to subtasks by index, with caveats.
The main problem will be that the shard-subtask mapping won't be deterministic, while current hash based solution is (but causes skew). Possibly the trade-off will be difficult to overcome (for a generalized solution) without centralizing the shard assignment, which would in turn require something like side inputs. Any opinions on this? Would it be acceptable to make changes to the existing operator that make the shard assignment logic and hashing easier to customize? Thanks, Thomas