Hi Openinx. Thanks for bringing this to our attention. And many thanks to hiliwei for their willingness to tackle big problems and little problems.
I wanted to say that I think most anything that’s relatively close would be better than the current situation most likely (where the feature is disabled entirely). Thank you for your succinct summary of the situation. I tagged Dongjoon Hyun, one of the ORC VPs, in the PR and will reach out to him as well. I am inclined to agree that we need to consider the width of the types, as fields like binary or even string can be potentially quite wide compared to int. I like your suggestion to use an “average width” when used with the batch size, though subtracting batch size from average width seems slightly off… I would think maybe the average width needs to be multiples or divided with the batch size. Possibly I’m not understanding fully. How would you propose to get an “average width”, for use with the data that’s not been flushed to disk yet? And would it be an average width based on the actually observed data or just on the types? Again, I think that any approach is better than none, and we can iterate on the statistics collection. But I am inclined to agree, points (1) and (2) seem ok. And it would be beneficial to consider the points raised regarding (3). Thanks for bringing this to the dev list. And many thanks to hiliwei for their work so far! - Kyle On Thu, Mar 3, 2022 at 8:01 PM OpenInx <open...@gmail.com> wrote: > Hi Iceberg dev > > As we all know, in our current apache iceberg write path, the ORC file > writer cannot just roll over to a new file once its byte size reaches the > expected threshold. The core reason that we don't support this before is: > The lack of correct approach to estimate the byte size from an unclosed > ORC writer. > > In this PR: https://github.com/apache/iceberg/pull/3784, hiliwei is > trying to propose an estimate approach to fix this fundamentally (Also > enabled all those ORC writer unit tests that we disabled intentionally > before). > > The approach is: If a file is still unclosed , let's estimate its size in > three steps ( PR: > https://github.com/apache/iceberg/pull/3784/files#diff-e7fcc622bb5551f5158e35bd0e929e6eeec73717d1a01465eaa691ed098af3c0R107 > ) > > 1. Size of data that has been written to stripe.The value is obtained by > summing the offset and length of the last stripe of the writer. > 2. Size of data that has been submitted to the writer but has not been > written to the stripe. When creating OrcFileAppender, treeWriter is > obtained through reflection, and uses its estimateMemory to estimate how > much memory is being used. > 3. Data that has not been submitted to the writer, that is, the size of > the buffer. The maximum default value of the buffer is used here. > > My feeling is: > > For the file-persisted bytes , I think using the last strip's offset plus > its length should be correct. For the memory encoded batch vector , the > TreeWriter#estimateMemory should be okay. > But for the batch vector whose rows did not flush to encoded memory, using > the batch.size shouldn't be correct. Because the rows can be any data type, > such as Integer, Long, Timestamp, String etc. As their widths are not the > same, I think we may need to use an average width minus the batch.size > (which is row count actually). > > Another thing is about the `TreeWriter#estimateMemory` method, The > current `org.apache.orc.Writer` don't expose the `TreeWriter` field or > `estimateMemory` method to public, I will suggest to publish a PR to > apache ORC project to expose those interfaces in `org.apache.orc.Writer` ( > see: https://github.com/apache/iceberg/pull/3784/files#r819238427 ) > > I'd like to invite the iceberg dev to evaluate the current approach. Is > there any other concern from the ORC experts' side ? > > Thanks. >