[Specification] Uniqueness of Data file names and cardinality in a snapshot?

2022-03-03 Thread Micah Kornfield
Hi Iceberg Dev, I tried searching for it in the specification but couldn't find anything explicit: 1. Is it assumed that all data files and delete files will always have globally unique names in a table? 2. Is it expected that the pairwise intersection of all manifest files in a snapshot is empt

[DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-03 Thread OpenInx
Hi Iceberg dev As we all know, in our current apache iceberg write path, the ORC file writer cannot just roll over to a new file once its byte size reaches the expected threshold. The core reason that we don't support this before is: The lack of correct approach to estimate the byte size from

Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-03 Thread Kyle Bendickson
Hi Openinx. Thanks for bringing this to our attention. And many thanks to hiliwei for their willingness to tackle big problems and little problems. I wanted to say that I think most anything that’s relatively close would be better than the current situation most likely (where the feature is disab

Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-03 Thread liwei li
Thanks to openinx for opening this discussion. One thing to note, the current approach faces a problem, because of some optimization mechanisms, when writing a large amount of duplicate data, there will be some deviation between the estimated and the actual size. However, when cached data is flush

Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-03 Thread OpenInx
> As their widths are not the same, I think we may need to use an average width minus the batch.size (which is row count actually). @Kyle, sorry I miss-typed the word before. I mean "need an average width multiplied by the batch.size". On Fri, Mar 4, 2022 at 1:29 PM liwei li wrote: > Thanks to