> As their widths are not the same, I think we may need to use an average width minus the batch.size (which is row count actually).
@Kyle, sorry I miss-typed the word before. I mean "need an average width multiplied by the batch.size". On Fri, Mar 4, 2022 at 1:29 PM liwei li <hilili...@gmail.com> wrote: > Thanks to openinx for opening this discussion. > > One thing to note, the current approach faces a problem, because of some > optimization mechanisms, when writing a large amount of duplicate data, > there will be some deviation between the estimated and the actual size. > However, when cached data is flushed (the amount or size exceeds the > threshold), the estimate is revised. That's one of the reasons I changed > this > https://github.com/apache/iceberg/pull/3784#discussion_r819229491 > > > Kyle Bendickson <k...@tabular.io>于2022年3月4日 周五12:55写道: > >> Hi Openinx. >> >> Thanks for bringing this to our attention. And many thanks to hiliwei for >> their willingness to tackle big problems and little problems. >> >> I wanted to say that I think most anything that’s relatively close would >> be better than the current situation most likely (where the feature is >> disabled entirely). >> >> Thank you for your succinct summary of the situation. I tagged Dongjoon >> Hyun, one of the ORC VPs, in the PR and will reach out to him as well. >> >> I am inclined to agree that we need to consider the width of the types, >> as fields like binary or even string can be potentially quite wide compared >> to int. >> >> I like your suggestion to use an “average width” when used with the batch >> size, though subtracting batch size from average width seems slightly off… >> I would think maybe the average width needs to be multiples or divided with >> the batch size. Possibly I’m not understanding fully. >> >> How would you propose to get an “average width”, for use with the data >> that’s not been flushed to disk yet? And would it be an average width based >> on the actually observed data or just on the types? >> >> Again, I think that any approach is better than none, and we can iterate >> on the statistics collection. But I am inclined to agree, points (1) and >> (2) seem ok. And it would be beneficial to consider the points raised >> regarding (3). >> >> Thanks for bringing this to the dev list. >> >> And many thanks to hiliwei for their work so far! >> >> - Kyle >> >> On Thu, Mar 3, 2022 at 8:01 PM OpenInx <open...@gmail.com> wrote: >> >>> Hi Iceberg dev >>> >>> As we all know, in our current apache iceberg write path, the ORC file >>> writer cannot just roll over to a new file once its byte size reaches the >>> expected threshold. The core reason that we don't support this before is: >>> The lack of correct approach to estimate the byte size from an unclosed >>> ORC writer. >>> >>> In this PR: https://github.com/apache/iceberg/pull/3784, hiliwei is >>> trying to propose an estimate approach to fix this fundamentally (Also >>> enabled all those ORC writer unit tests that we disabled intentionally >>> before). >>> >>> The approach is: If a file is still unclosed , let's estimate its size >>> in three steps ( PR: >>> https://github.com/apache/iceberg/pull/3784/files#diff-e7fcc622bb5551f5158e35bd0e929e6eeec73717d1a01465eaa691ed098af3c0R107 >>> ) >>> >>> 1. Size of data that has been written to stripe.The value is obtained by >>> summing the offset and length of the last stripe of the writer. >>> 2. Size of data that has been submitted to the writer but has not been >>> written to the stripe. When creating OrcFileAppender, treeWriter is >>> obtained through reflection, and uses its estimateMemory to estimate how >>> much memory is being used. >>> 3. Data that has not been submitted to the writer, that is, the size of >>> the buffer. The maximum default value of the buffer is used here. >>> >>> My feeling is: >>> >>> For the file-persisted bytes , I think using the last strip's offset >>> plus its length should be correct. For the memory encoded batch vector , >>> the TreeWriter#estimateMemory should be okay. >>> But for the batch vector whose rows did not flush to encoded memory, >>> using the batch.size shouldn't be correct. Because the rows can be any data >>> type, such as Integer, Long, Timestamp, String etc. As their widths are not >>> the same, I think we may need to use an average width minus the batch.size >>> (which is row count actually). >>> >>> Another thing is about the `TreeWriter#estimateMemory` method, The >>> current `org.apache.orc.Writer` don't expose the `TreeWriter` field or >>> `estimateMemory` method to public, I will suggest to publish a PR to >>> apache ORC project to expose those interfaces in `org.apache.orc.Writer` ( >>> see: https://github.com/apache/iceberg/pull/3784/files#r819238427 ) >>> >>> I'd like to invite the iceberg dev to evaluate the current approach. Is >>> there any other concern from the ORC experts' side ? >>> >>> Thanks. >>> >>