Hi Iceberg dev As we all know, in our current apache iceberg write path, the ORC file writer cannot just roll over to a new file once its byte size reaches the expected threshold. The core reason that we don't support this before is: The lack of correct approach to estimate the byte size from an unclosed ORC writer.
In this PR: https://github.com/apache/iceberg/pull/3784, hiliwei is trying to propose an estimate approach to fix this fundamentally (Also enabled all those ORC writer unit tests that we disabled intentionally before). The approach is: If a file is still unclosed , let's estimate its size in three steps ( PR: https://github.com/apache/iceberg/pull/3784/files#diff-e7fcc622bb5551f5158e35bd0e929e6eeec73717d1a01465eaa691ed098af3c0R107 ) 1. Size of data that has been written to stripe.The value is obtained by summing the offset and length of the last stripe of the writer. 2. Size of data that has been submitted to the writer but has not been written to the stripe. When creating OrcFileAppender, treeWriter is obtained through reflection, and uses its estimateMemory to estimate how much memory is being used. 3. Data that has not been submitted to the writer, that is, the size of the buffer. The maximum default value of the buffer is used here. My feeling is: For the file-persisted bytes , I think using the last strip's offset plus its length should be correct. For the memory encoded batch vector , the TreeWriter#estimateMemory should be okay. But for the batch vector whose rows did not flush to encoded memory, using the batch.size shouldn't be correct. Because the rows can be any data type, such as Integer, Long, Timestamp, String etc. As their widths are not the same, I think we may need to use an average width minus the batch.size (which is row count actually). Another thing is about the `TreeWriter#estimateMemory` method, The current `org.apache.orc.Writer` don't expose the `TreeWriter` field or `estimateMemory` method to public, I will suggest to publish a PR to apache ORC project to expose those interfaces in `org.apache.orc.Writer` ( see: https://github.com/apache/iceberg/pull/3784/files#r819238427 ) I'd like to invite the iceberg dev to evaluate the current approach. Is there any other concern from the ORC experts' side ? Thanks.