Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

OpenInx Thu, 03 Mar 2022 22:30:57 -0800

> As their widths are not the same, I think we may need to use an average
width minus the batch.size (which is row count actually).


@Kyle, sorry I miss-typed the word before.  I mean "need an average width
multiplied by the batch.size".

On Fri, Mar 4, 2022 at 1:29 PM liwei li <hilili...@gmail.com> wrote:

> Thanks to openinx for opening this discussion.
>
> One thing to note, the current approach faces a problem, because of some
> optimization mechanisms, when writing a large amount of duplicate data,
> there will be some deviation between the estimated and the actual size.
> However, when cached data is flushed (the amount or size exceeds the
> threshold), the estimate is revised. That's one of the reasons I changed
> this
> https://github.com/apache/iceberg/pull/3784#discussion_r819229491
>
>
> Kyle Bendickson <k...@tabular.io>于2022年3月4日 周五12:55写道：
>
>> Hi Openinx.
>>
>> Thanks for bringing this to our attention. And many thanks to hiliwei for
>> their willingness to tackle big problems and little problems.
>>
>> I wanted to say that I think most anything that’s relatively close would
>> be better than the current situation most likely (where the feature is
>> disabled entirely).
>>
>> Thank you for your succinct summary of the situation. I tagged Dongjoon
>> Hyun, one of the ORC VPs, in the PR and will reach out to him as well.
>>
>> I am inclined to agree that we need to consider the width of the types,
>> as fields like binary or even string can be potentially quite wide compared
>> to int.
>>
>> I like your suggestion to use an “average width” when used with the batch
>> size, though subtracting batch size from average width seems slightly off…
>> I would think maybe the average width needs to be multiples or divided with
>> the batch size. Possibly I’m not understanding fully.
>>
>> How would you propose to get an “average width”, for use with the data
>> that’s not been flushed to disk yet? And would it be an average width based
>> on the actually observed data or just on the types?
>>
>> Again, I think that any approach is better than none, and we can iterate
>> on the statistics collection. But I am inclined to agree, points (1) and
>> (2) seem ok. And it would be beneficial to consider the points raised
>> regarding (3).
>>
>> Thanks for bringing this to the dev list.
>>
>> And many thanks to hiliwei for their work so far!
>>
>> - Kyle
>>
>> On Thu, Mar 3, 2022 at 8:01 PM OpenInx <open...@gmail.com> wrote:
>>
>>> Hi Iceberg dev
>>>
>>> As we all know,  in our current apache iceberg write path,  the ORC file
>>> writer cannot just roll over to a new file once its byte size reaches the
>>> expected threshold.  The core reason that we don't support this before is:
>>>   The lack of correct approach to estimate the byte size from an unclosed
>>> ORC writer.
>>>
>>> In this PR: https://github.com/apache/iceberg/pull/3784,  hiliwei is
>>> trying to propose an estimate approach to fix this fundamentally (Also
>>> enabled all those ORC writer unit tests that we disabled intentionally
>>> before).
>>>
>>> The approach is:  If a file is still unclosed , let's estimate its size
>>> in three steps ( PR:
>>> https://github.com/apache/iceberg/pull/3784/files#diff-e7fcc622bb5551f5158e35bd0e929e6eeec73717d1a01465eaa691ed098af3c0R107
>>> )
>>>
>>> 1. Size of data that has been written to stripe.The value is obtained by
>>> summing the offset and length of the last stripe of the writer.
>>> 2. Size of data that has been submitted to the writer but has not been
>>> written to the stripe. When creating OrcFileAppender, treeWriter is
>>> obtained through reflection, and uses its estimateMemory to estimate how
>>> much memory is being used.
>>> 3. Data that has not been submitted to the writer, that is, the size of
>>> the buffer. The maximum default value of the buffer is used here.
>>>
>>> My feeling is:
>>>
>>> For the file-persisted bytes , I think using the last strip's offset
>>> plus its length should be correct. For the memory encoded batch vector ,
>>> the TreeWriter#estimateMemory should be okay.
>>> But for the batch vector whose rows did not flush to encoded memory,
>>> using the batch.size shouldn't be correct. Because the rows can be any data
>>> type, such as Integer, Long, Timestamp, String etc. As their widths are not
>>> the same, I think we may need to use an average width minus the batch.size
>>> (which is row count actually).
>>>
>>> Another thing is about the `TreeWriter#estimateMemory` method,  The
>>> current `org.apache.orc.Writer`  don't expose the `TreeWriter` field or
>>> `estimateMemory` method to public,  I will suggest to publish a PR to
>>> apache ORC project to expose those interfaces in `org.apache.orc.Writer` (
>>> see: https://github.com/apache/iceberg/pull/3784/files#r819238427 )
>>>
>>> I'd like to invite the iceberg dev to evaluate the current approach.  Is
>>> there any other concern from the ORC experts' side ?
>>>
>>> Thanks.
>>>
>>

Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

Reply via email to