Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

OpenInx Mon, 07 Mar 2022 00:38:27 -0800

Thanks Dongjoon & Yiqun for the quick PR for adding the `estimateMemory`
API.


Also thanks Yiqun & Owen for your points,  I think you are right.  So
a more accurate estimation method may be to multiply batch.size by the
average width of the data type, and then multiply it by the compression
rate, which is usually an empirical value.



On Sat, Mar 5, 2022 at 2:28 AM Owen O'Malley <[email protected]> wrote:

> At the stripe boundaries, the bytes on disk statistics are accurate. A
> stripe that is in flight, is going to be an estimate, because the
> dictionaries can't be compressed until the stripe is flushed. The memory
> usage will be a significant over estimate, because it includes buffers that
> are allocated, but not used yet.
>
> .. Owen
>
> On Fri, Mar 4, 2022 at 5:23 PM Dongjoon Hyun <[email protected]> wrote:
>
>> The following is merged for Apache ORC 1.7.4.
>>
>> ORC-1123 Add `estimationMemory` method for writer
>>
>> According to the Apache ORC milestone, it will be released on May 15th.
>>
>> https://github.com/apache/orc/milestones
>>
>> Bests,
>> Dongjoon.
>>
>> On 2022/03/04 13:11:15 Yiqun Zhang wrote:
>> > Hi Openinx
>> >
>> > Thank you for initiating this discussion. I think we can get the
>> `TypeDescription` from the writer and in the `TypeDescription` we know
>> which types and more precisely the maximum length of the varchar/char. This
>> will help us to estimate the average width.
>> >
>> > Also, I agree with your suggestion, I will make a PR later to add the
>> `estimateMemory` public method for Writer.
>> >
>> > On 2022/03/04 04:01:04 OpenInx wrote:
>> > > Hi Iceberg dev
>> > >
>> > > As we all know,  in our current apache iceberg write path,  the ORC
>> file
>> > > writer cannot just roll over to a new file once its byte size reaches
>> the
>> > > expected threshold.  The core reason that we don't support this
>> before is:
>> > >   The lack of correct approach to estimate the byte size from an
>> unclosed
>> > > ORC writer.
>> > >
>> > > In this PR: https://github.com/apache/iceberg/pull/3784,  hiliwei is
>> trying
>> > > to propose an estimate approach to fix this fundamentally (Also
>> enabled all
>> > > those ORC writer unit tests that we disabled intentionally before).
>> > >
>> > > The approach is:  If a file is still unclosed , let's estimate its
>> size in
>> > > three steps ( PR:
>> > >
>> https://github.com/apache/iceberg/pull/3784/files#diff-e7fcc622bb5551f5158e35bd0e929e6eeec73717d1a01465eaa691ed098af3c0R107
>> > > )
>> > >
>> > > 1. Size of data that has been written to stripe.The value is obtained
>> by
>> > > summing the offset and length of the last stripe of the writer.
>> > > 2. Size of data that has been submitted to the writer but has not been
>> > > written to the stripe. When creating OrcFileAppender, treeWriter is
>> > > obtained through reflection, and uses its estimateMemory to estimate
>> how
>> > > much memory is being used.
>> > > 3. Data that has not been submitted to the writer, that is, the size
>> of the
>> > > buffer. The maximum default value of the buffer is used here.
>> > >
>> > > My feeling is:
>> > >
>> > > For the file-persisted bytes , I think using the last strip's offset
>> plus
>> > > its length should be correct. For the memory encoded batch vector ,
>> the
>> > > TreeWriter#estimateMemory should be okay.
>> > > But for the batch vector whose rows did not flush to encoded memory,
>> using
>> > > the batch.size shouldn't be correct. Because the rows can be any data
>> type,
>> > > such as Integer, Long, Timestamp, String etc. As their widths are not
>> the
>> > > same, I think we may need to use an average width minus the batch.size
>> > > (which is row count actually).
>> > >
>> > > Another thing is about the `TreeWriter#estimateMemory` method,  The
>> current
>> > > `org.apache.orc.Writer`  don't expose the `TreeWriter` field or
>> > > `estimateMemory` method to public,  I will suggest to publish a PR to
>> > > apache ORC project to expose those interfaces in
>> `org.apache.orc.Writer` (
>> > > see: https://github.com/apache/iceberg/pull/3784/files#r819238427 )
>> > >
>> > > I'd like to invite the iceberg dev to evaluate the current approach.
>> Is
>> > > there any other concern from the ORC experts' side ?
>> > >
>> > > Thanks.
>> > >
>> >
>>
>

Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

Reply via email to