venkata91 commented on PR #26592: URL: https://github.com/apache/flink/pull/26592#issuecomment-2920638048
> > > > Thanks for the patch. It is a good catch. > > > > I understand it might be a little tricky to do unit test. but we may be able to do integration test. We can do the following: > > > > > > > > 1. start a session cluster with normal size JM/TM ( < 32GB ) > > > > 2. when launch the job on the command line, use an additional JVM arg of `-XX:-UseCompressedOops` to force a 24-bytes object header with JVM memory > 4 GB. > > > > > > > > This way we should be able to test it. > > > > > > > > > Thanks for reviewing the patch ! > > > Yes, I'm working on e-2-e test with `FlinkContainers`. In this case, we need 2 different JVMs (JM and TM) one with `UseCompressedOops` and one without that. Still working on the e-2-e test, will post it once it is ready. > > > > > > Last week, I was trying to reproduce the issue locally for Flink SQL with inner join and custom datasets, but it is not reproducible locally. Note: with remote debugging I could confirm the issue still exists - basically JM codegens `HashPartitioner` with `arrayBaseOffset` = 24 while on the TM side it is 16. What happens is the following: > > > > 1. Bytes [16 ... 31] is where the key is present and length = 16 > > 2. Bytes [32 ... 39] is filled with JVM garbage which looks something like [1, 0, 0, 0, ... 0] > > 3. Hash function reads bytes from [24 ... 39] where [24 ... 31] first 8 bytes are valid bytes of the key and the remaining 8 bytes [31 ... 39] is just JVM garbage. > > 4. The same pattern exists for both the occurrences of the same key (in fact for all the keys). > > 5. Due to the above, the hashcode computed is consistent b/w both the occurrences of the same key and gets shuffled to the appropriate task, even though the hashcode computation is wrong. > > > > I'm still trying to construct a dataset where [32 ... 39] bytes are different for the same key. Note: even in our internal environment, this issue is reproducible only when `nested fields projection pushdown` is enabled. This optimization is implemented in our internal connector. > > Upon investigating it further, looks like `json` FileSystemSource supports `nested fields projection pushdown`. With that enabled, I am able to reproduce it locally. Let me add a ITCase using that. I think I got misled by the [JsonFormatFactory.supportsNestedProjectionPushdown](https://github.com/apache/flink/blob/master/flink-formats/flink-json/src/main/java/org/apache/flink/formats/json/JsonFormatFactory.java#L113), but seems like FileSystemTableSource that wraps the JsonFormat simply returns false. For a moment I thought I was able to reproduce it locally, but seems like I misspoke too soon. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org