Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

Yiqun Zhang Fri, 04 Mar 2022 05:11:21 -0800

Hi Openinx

Thank you for initiating this discussion. I think we can get the 
`TypeDescription` from the writer and in the `TypeDescription` we know which 
types and more precisely the maximum length of the varchar/char. This will help 
us to estimate the average width.


Also, I agree with your suggestion, I will make a PR later to add the 
`estimateMemory` public method for Writer.

On 2022/03/04 04:01:04 OpenInx wrote:
> Hi Iceberg dev
> 
> As we all know,  in our current apache iceberg write path,  the ORC file
> writer cannot just roll over to a new file once its byte size reaches the
> expected threshold.  The core reason that we don't support this before is:
>   The lack of correct approach to estimate the byte size from an unclosed
> ORC writer.
> 
> In this PR: https://github.com/apache/iceberg/pull/3784,  hiliwei is trying
> to propose an estimate approach to fix this fundamentally (Also enabled all
> those ORC writer unit tests that we disabled intentionally before).
> 
> The approach is:  If a file is still unclosed , let's estimate its size in
> three steps ( PR:
> https://github.com/apache/iceberg/pull/3784/files#diff-e7fcc622bb5551f5158e35bd0e929e6eeec73717d1a01465eaa691ed098af3c0R107
> )
> 
> 1. Size of data that has been written to stripe.The value is obtained by
> summing the offset and length of the last stripe of the writer.
> 2. Size of data that has been submitted to the writer but has not been
> written to the stripe. When creating OrcFileAppender, treeWriter is
> obtained through reflection, and uses its estimateMemory to estimate how
> much memory is being used.
> 3. Data that has not been submitted to the writer, that is, the size of the
> buffer. The maximum default value of the buffer is used here.
> 
> My feeling is:
> 
> For the file-persisted bytes , I think using the last strip's offset plus
> its length should be correct. For the memory encoded batch vector , the
> TreeWriter#estimateMemory should be okay.
> But for the batch vector whose rows did not flush to encoded memory, using
> the batch.size shouldn't be correct. Because the rows can be any data type,
> such as Integer, Long, Timestamp, String etc. As their widths are not the
> same, I think we may need to use an average width minus the batch.size
> (which is row count actually).
> 
> Another thing is about the `TreeWriter#estimateMemory` method,  The current
> `org.apache.orc.Writer`  don't expose the `TreeWriter` field or
> `estimateMemory` method to public,  I will suggest to publish a PR to
> apache ORC project to expose those interfaces in `org.apache.orc.Writer` (
> see: https://github.com/apache/iceberg/pull/3784/files#r819238427 )
> 
> I'd like to invite the iceberg dev to evaluate the current approach.  Is
> there any other concern from the ORC experts' side ?
> 
> Thanks.
>

Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

Reply via email to