Hi Max,

The way that Flink to assign key to which subtask is based on
`KeyGroupRangeAssignment.assignKeyToParallelOperator`.
Its first step is to assign key to a key group based on the max parallelism
[2]. Then, assign each key group to a specific subtask based on the current
parallelism [3].

The question that you asked is if the keyBy in Flink is deterministic. I
think the answer is yes, but the problem is that assignment to key group is
not just `obj.hashCode()`, but `murmurhash(obj.hashCode())` instead.
If you can know the output from murmurhash on the each object, you can
determine which subtask that operator will go to.
I'm not sure if this is a good solution and I am also wondering if it can
be fulfilled.

Best Regards,
Tony Wei

[1]
https://github.com/apache/flink/blob/0182141d41be52ae0cf4a3563d4b8c6f3daca02e/flink-runtime/src/main/java/org/apache/flink/runtime/state/KeyGroupRangeAssignment.java#L47
[2]
https://github.com/apache/flink/blob/0182141d41be52ae0cf4a3563d4b8c6f3daca02e/flink-runtime/src/main/java/org/apache/flink/runtime/state/KeyGroupRangeAssignment.java#L58
[3]
https://github.com/apache/flink/blob/0182141d41be52ae0cf4a3563d4b8c6f3daca02e/flink-runtime/src/main/java/org/apache/flink/runtime/state/KeyGroupRangeAssignment.java#L115

2017-10-30 23:20 GMT+08:00 m@xi <makisnt...@gmail.com>:

> Hi all,
>
> After trying to understand exactly how keyBy works internally, I did not
> get
> anything more than "it applies obj.hashcode() % n", where n is the number
> of
> tasks/processors.
>
> This post for example
> https://stackoverflow.com/questions/45062061/why-is-
> keyed-stream-on-a-keyby-creating-skewed-downstream-execution,
> suggest to implement a KeySelector and write our own hashcode function.
> Though none of the above is clear, especially the hashcode part.
>
> I am running a pc with 4 slots/processors and I would like to hash each
> record based on a certain field to a specific processor. Ideally, lets say
> that the 4 processors have ids: 0, 1, 2, 3. Then I would like to send the
> tuples whose (key % 4) = 0 to the proc with id 0,  (key % 4) = 1 to the
> proc
> with id 1 etc etc.
>
> I would like to know exactly to which processor/task each tuple goes.
> Can I do that deterministically with keyBy in Flink??
>
> Thanks in advance.
> Max
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/
>

Reply via email to