dmvk commented on code in PR #24079: URL: https://github.com/apache/flink/pull/24079#discussion_r1450490276
########## flink-runtime/src/main/java/org/apache/flink/runtime/state/CompressibleFSDataOutputStream.java: ########## @@ -41,6 +41,10 @@ public CompressibleFSDataOutputStream( @Override public long getPos() throws IOException { + // Underlying compression involves buffering, so the only way to report correct position is + // to flush the underlying stream. This lowers the effectivity of compression, but there is + // no other way, since the position is often used as a split point. + flush(); Review Comment: There is a general rule that compressed blocks can't be split. This is why most systems distinguish between per-file, per-block or per-record compression. In this case we're using per-record compression. It could be optimized by some form of record batching / block compression, but that would need to happen further up the stack, which goes beyond a bug fix. --- > It's somewhat strange that the result is different whether we call getPos() in the process of compressing the output, or we don't. This may later come as a big surprise for someone using CompressibleFSDataOutputStream. Not really. The position corresponds to the uncompressed stream. If you haven't flushed any data into it .. the position won't move. > IMO, a better - albeit more involved - approach may be to change the semantics of getPos() for CompressibleFSDataOutputStream. This can represent the position in the uncompressed data. `getPos()` has a strictly defined contract on the interface level, we need to hold up to it. Chaging it for a single implementation violates Liskov Substitution Principle. ``` * <p>This method must report accurately report the current position of the stream. Various * components of the high-availability and recovery logic rely on the accurate ``` --- As I said, this could clearly be optimized by switching from a per-record compression to block compression, but that goes beyond critical bugfix and should be addressed separately (I'll be happy to do a review if you have any ideas). WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org