Github user s1ck commented on the pull request: https://github.com/apache/flink/pull/1075#issuecomment-136380943 There is already a test case for zipWithUniqueId() in https://github.com/apache/flink/blob/master/flink-tests/src/test/java/org/apache/flink/test/util/DataSetUtilsITCase.java#L66 However, this test is under the assumption that there is only one task running, which is why it did not fail in the first place. If there are multiple tasks, the resulting unique id is not deterministic for a single dataset element. I would implement a test, that creates a dataset, applies the `zipWithUniqueId` method, calls `distinct(0)` on the created ids and checks the number of resulting elements (must be equal to the input dataset). Would this be sufficient? Furthermore, the current test cases for `DataSetUtils` assume a resulting dataset as string and check this after each test run. My proposed test would not fit in that scheme. Should I create a new test case class for this method? @StephanEwen I wanted to do this, but static doesn't work with anonymous classes. However, I can declare the UDF as a private inner class (didn't want to change much code). @HuangWHWHW the `log2` method already existed and in the issue, I proposed to rename it. Maybe `getBitSize(long value)`? As for the "proof": if each task id is smaller than the total number of parallel tasks t, its bit representation is also smaller than the bit representation of t. Thus, when we shift the counter by the number of bits of t, there cannot be a collision for different task ids
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---