+1 non-binding from me.

I love this idea. Even though S3 supports reading from the tail, this value can 
still be useful in cases where the size is incorrectly hinted, requiring an 
additional read for the Parquet footer size.

On Mon, Feb 10, 2025, at 09:21, Anton Okolnychyi wrote:
> +1 from me, I'd love to see this implemented (maybe even in V3 if anyone is 
> willing to pick it up?).
> 
> Eduard and I were discussing DV file compaction where we need to know the 
> ratio of live vs orphan DVs in a particular DV file. Manifests contain sizes 
> of individual DV blobs as well as the total DV file size. In order to compute 
> the ratio of live DVs accurately, we have to subtract the footer size from 
> the total file size. Doing this without an extra read would be great.
> 
> - Anton
> 
> нд, 9 лют. 2025 р. о 13:51 Daniel Weeks <dwe...@apache.org> пише:
>> Hey Sreeram,
>> 
>> Sounds like there's a fair amount of interest/support for this.  Anton also 
>> mentioned that having this information would help estimate orphaned DVs, so 
>> there's multiple cases where this would be beneficial.
>> 
>> We might want to tie this change to a format version release (even if just 
>> an optional field) because any metadata rewrites may result in dropping the 
>> value.
>> 
>> Did you want to put together a proposal for the changes?
>> 
>> Best,
>> -Dan
>> 
>> On Sat, Feb 8, 2025 at 11:31 AM Micah Kornfield <emkornfi...@gmail.com> 
>> wrote:
>>> +1  I think this is probably useful for wider schemas that typically have 
>>> larger footers that go past the heuristic.  
>>> 
>>> It would be good to have some concrete numbers on how much this impacts 
>>> workloads before committing to it.
>>> 
>>> Cheers,
>>> Micah
>>> 
>>> On Thu, Jan 30, 2025 at 7:10 AM Steve Loughran 
>>> <ste...@cloudera.com.invalid> wrote:
>>>> 
>>>> Knowing the footer offset would be really useful if passed down to 
>>>> whatever is implementing the input stream, along with the actual file size.
>>>> 
>>>> This can be used for prefetching the footer, as well as caching it (Azure 
>>>> ABFS, google GCS connectors): right now they guess that about 1MB is all 
>>>> they need.
>>>> 
>>>> while readTail() can get bytes off the end, it doesn't pass that 
>>>> information down to the stream, to do its own thing.
>>>> 
>>>> The Analytics stream which the AWS S3 team are getting into the s3a code 
>>>> (https://issues.apache.org/jira/browse/HADOOP-19363) goes one step further 
>>>> than the others: it parses that footer itself and tries to predict where 
>>>> application code is going to read next: as you read one rowgroup it 
>>>> speculatively fetch the next one, even as the first one is downloaded.
>>>> 
>>>> Again, it guesses on footer size: pass that in and they will know what to 
>>>> fetch and store. Ideally this should be accompanied by file type (parquet, 
>>>> avro) and your actual read plans (vectored, random, sequential, 
>>>> whole-file). With this information you can cut out a number of 
>>>> wasted/inefficient S3 calls, and tune fetching/caching policy 
>>>> appropriately.
>>>> 
>>>> Anyway: 
>>>> +1 to footer length, and if already known, file length should come down 
>>>> too, along with that read plan. saying "parque, vectored, randomt" will be 
>>>> enough, which is what a draft PR i have for hadoop fileIO does.
>>>> 
>>>> On Wed, 22 Jan 2025 at 03:39, Sreeram Garlapati <gsreeramku...@gmail.com> 
>>>> wrote:
>>>>> Thanks for the nice idea/suggestion, Dan. 
>>>>> Yes, we have been employing a similar technique that you noted below and 
>>>>> kinda arrived at the conclusion that there is no deterministic way to 
>>>>> achieve that most optimal situation, ie., single i/o call to S3 to read 
>>>>> the parquet footer.
>>>>> 
>>>>> Best,
>>>>> Sreeram
>>>>> 
>>>>> On Tue, Jan 21, 2025 at 4:20 PM Daniel Weeks <dwe...@apache.org> wrote:
>>>>>> Hey Sreeram,
>>>>>> 
>>>>>> I think it's worthwhile to consider what value would be added by 
>>>>>> tracking the footer size in metadata, but there are other options to 
>>>>>> address these optimization use cases.
>>>>>> 
>>>>>> For example, if you take a look at the RangeReadable 
>>>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/io/RangeReadable.java#L68>
>>>>>>  interface for FileIO implementations, there's a readTail method so that 
>>>>>> you can optimistically read from the tail end of the file to try to 
>>>>>> fetch the full footer in a single read.  This is even optimized in some 
>>>>>> of the implementations (like S3InputStream) to leverage backward reads 
>>>>>> as opposed to seek operations which might have overhead.
>>>>>> 
>>>>>> Depending on the size of the file, you may want to load just the tail or 
>>>>>> the whole file to avoid all reads.  Having the exact value definitely 
>>>>>> will make this more exact, but I feel like using the above approach can 
>>>>>> approximate the same performance benefits.
>>>>>> 
>>>>>> Just a thought,
>>>>>> -Dan
>>>>>> 
>>>>>> On Tue, Jan 21, 2025 at 12:17 PM Sreeram Garlapati 
>>>>>> <gsreeramku...@gmail.com> wrote:
>>>>>>> Hello Team!
>>>>>>> 
>>>>>>> This is a small improvement proposal to store the _*parquet footer 
>>>>>>> size*_ as part of the *data_file* metadata in the iceberg manifest 
>>>>>>> <https://iceberg.apache.org/spec/#manifests>. 
>>>>>>> **manifest_entry   >   (2) data_file  >  (146 Optional) 
>>>>>>> footer_size_in_bytes**
>>>>>>> 
>>>>>>> _Motivation_:
>>>>>>>  • We have several sub-second read use cases on iceberg tables. We 
>>>>>>> store icebergs and parquets on S3. Every hop to S3 is v.expensive (P99 
>>>>>>> of >200 milliseconds). Hence we are trying to see if we can optimize by 
>>>>>>> cutting down any of these hops. One such hop is during the Parquet file 
>>>>>>> read., the first read to the parquet, which is to read the last 8 bytes 
>>>>>>> - to read the - footer size and par1 sequence.
>>>>>>>  • Iceberg metadata already includes the file_size_in_bytes. Including 
>>>>>>> the footer size benefits all the readers. ie., readers can directly 
>>>>>>> issue 1 I/O call to read the footer - *read_parquet_footer(filehandle, 
>>>>>>> offset=file_size_in_bytes-footer_size_in_bytes-1)*
>>>>>>>  • This is similar to what we have in the iceberg specification in the 
>>>>>>> case of storing Table statistics 
>>>>>>> <https://iceberg.apache.org/spec/#table-statistics>, puffins > 
>>>>>>> `*file-footer-size-in-bytes*`.
>>>>>>>  • This can be easily extended to ORC as needed too. Perhaps, in the 
>>>>>>> ORC case, an additional property to store the postscript length is also 
>>>>>>> needed.
>>>>>>> Truly appreciate your thoughts,
>>>>>>> Sreeram <https://www.linkedin.com/in/sreeramgarlapati> 
>>>>>>> 
Xuanwo

https://xuanwo.io/

Reply via email to