Hi Max, The way that Flink to assign key to which subtask is based on `KeyGroupRangeAssignment.assignKeyToParallelOperator`. Its first step is to assign key to a key group based on the max parallelism [2]. Then, assign each key group to a specific subtask based on the current parallelism [3].
The question that you asked is if the keyBy in Flink is deterministic. I think the answer is yes, but the problem is that assignment to key group is not just `obj.hashCode()`, but `murmurhash(obj.hashCode())` instead. If you can know the output from murmurhash on the each object, you can determine which subtask that operator will go to. I'm not sure if this is a good solution and I am also wondering if it can be fulfilled. Best Regards, Tony Wei [1] https://github.com/apache/flink/blob/0182141d41be52ae0cf4a3563d4b8c6f3daca02e/flink-runtime/src/main/java/org/apache/flink/runtime/state/KeyGroupRangeAssignment.java#L47 [2] https://github.com/apache/flink/blob/0182141d41be52ae0cf4a3563d4b8c6f3daca02e/flink-runtime/src/main/java/org/apache/flink/runtime/state/KeyGroupRangeAssignment.java#L58 [3] https://github.com/apache/flink/blob/0182141d41be52ae0cf4a3563d4b8c6f3daca02e/flink-runtime/src/main/java/org/apache/flink/runtime/state/KeyGroupRangeAssignment.java#L115 2017-10-30 23:20 GMT+08:00 m@xi <makisnt...@gmail.com>: > Hi all, > > After trying to understand exactly how keyBy works internally, I did not > get > anything more than "it applies obj.hashcode() % n", where n is the number > of > tasks/processors. > > This post for example > https://stackoverflow.com/questions/45062061/why-is- > keyed-stream-on-a-keyby-creating-skewed-downstream-execution, > suggest to implement a KeySelector and write our own hashcode function. > Though none of the above is clear, especially the hashcode part. > > I am running a pc with 4 slots/processors and I would like to hash each > record based on a certain field to a specific processor. Ideally, lets say > that the 4 processors have ids: 0, 1, 2, 3. Then I would like to send the > tuples whose (key % 4) = 0 to the proc with id 0, (key % 4) = 1 to the > proc > with id 1 etc etc. > > I would like to know exactly to which processor/task each tuple goes. > Can I do that deterministically with keyBy in Flink?? > > Thanks in advance. > Max > > > > -- > Sent from: http://apache-flink-user-mailing-list-archive.2336050. > n4.nabble.com/ >