[ 
https://issues.apache.org/jira/browse/FLINK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15903423#comment-15903423
 ] 

ASF GitHub Bot commented on FLINK-5874:
---------------------------------------

GitHub user kl0u opened a pull request:

    https://github.com/apache/flink/pull/3501

    [FLINK-5874] Restrict key types in the DataStream API.

    Rejects a type from being a key in `DataStream.keyBy()` if it is:
    1. it is a POJO type but does not override the `hashCode()` and
       relies on the `Object.hashCode()` implementation.
    2. it is an array of any type.
    
    This was also discussed with @fhueske 
    
    R @aljoscha 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kl0u/flink array-keys

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/3501.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3501
    
----
commit bd00f9d37d5e3e1960b8bf1fff506293784b3039
Author: kl0u <kklou...@gmail.com>
Date:   2017-03-08T11:11:07Z

    [FLINK-5874] Restrict key types in the DataStream API.
    
    Reject a type from being a key in keyBy() if it is:
    1. it is a POJO type but does not override the hashCode() and
       relies on the Object.hashCode() implementation.
    2. it is an array of any type.

----


> Reject arrays as keys in DataStream API to avoid inconsistent hashing
> ---------------------------------------------------------------------
>
>                 Key: FLINK-5874
>                 URL: https://issues.apache.org/jira/browse/FLINK-5874
>             Project: Flink
>          Issue Type: Bug
>          Components: DataStream API
>    Affects Versions: 1.2.0, 1.1.4
>            Reporter: Robert Metzger
>            Assignee: Kostas Kloudas
>            Priority: Blocker
>
> This issue has been reported on the mailing list twice:
> - 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Previously-working-job-fails-on-Flink-1-2-0-td11741.html
> - 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Arrays-values-in-keyBy-td7530.html
> The problem is the following: We are using just Key[].hashCode() to compute 
> the hash when shuffling data. Java's default hashCode() implementation 
> doesn't take the arrays contents into account, but the memory address.
> This leads to different hash code on the sender and receiver side.
> In Flink 1.1 this means that the data is shuffled randomly and not keyed, and 
> in Flink 1.2 the keygroups code detect a violation of the hashing.
> The proper fix of the problem would be to rely on Flink's {{TypeComparator}} 
> class, which has a type-specific hashing function. But introducing this 
> change would break compatibility with existing code.
> I'll file a JIRA for the 2.0 changes for that fix.
> For 1.2.1 and 1.3.0 we should at least reject arrays as keys.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to