Hi Ajantha,

Thank you for spending the time to look into this.

re a: I think I remember Ryan saying Parquet isn't good for bigger pieces
of data, and some stats sketches or indices can be bigger than others.
Also, the Parquet row logical / columnar storage format doesn't give as
much benefit for what's more closer to key-value storage

re b:
this is still tbd --
eg https://github.com/apache/iceberg/pull/4945
https://github.com/apache/iceberg/pull/5021

re c, e:
for partition-level, it's not decided yet how it will be handled

re d:
yes, ANALYZE can be separate operation, see
https://github.com/trinodb/trino/pull/12317 for POC

Best regards,
PF



On Tue, Jun 21, 2022 at 8:52 AM Ajantha Bhat <ajanthab...@gmail.com> wrote:

> Thank you Piotr for all of the work you’ve put into this.
>
> I just checked the spec. I have a few newbie questions.
>
> a. Instead of using an existing columnar format like parquet (one file for
> one type of stats) to store indexes, any reason why we have developed our
> own format and any benchmarks taken against Puffin vs other formats?
>
> b. How these Puffin files are linked to Iceberg's metadata files is still
> a missing link for me. As the Puffin spec says, these stats are table level
> (updated per snapshots). So, do we need an Iceberg spec change to store the
> file names of these Puffin files so that remove_orphan_files will not
> clean it up accidentally? (also needed for expire_snapshots)
>
> c. NDV's are column level stats. So, I expect the latest puffin file of
> that snapshot will have one row of stats representing stats for each
> column. But if we are to implement secondary index or table level partition
> stats, there can be many rows (millions) in puffin based on the dataset.
> So, for every commit, do we need to read the previous snapshot's Puffin
> file and write back a new file with updated stats? (the file might be very
> huge when data grows?). I think it will affect the commit time. Any
> thoughts on this?
>
> d. Slightly related to the above point, do we plan to asynchronously
> support collecting the stats like "ANALYZE table" and modify the table
> metadata with the stats file names? (might need an Iceberg commit to write
> new table metadata)
>
> e. Even though table level partition stats are available from _parition
> metadata table (along with filter push down support), computing metadata
> table per query will be expensive.
> Hence, we are looking forward to storing them in the Puffin format. But
> I'm not sure about storing it as a single file with millions of rows.
> I Would like to collaborate and discuss more on this.
>
> Thanks,
> Ajantha
>
> On Mon, Jun 13, 2022 at 2:45 AM Miao Wang <miw...@adobe.com.invalid>
> wrote:
>
>> +1 on the format! It looks great!
>>
>>
>>
>> Thanks for materializing the initial design idea.
>>
>>
>>
>> Miao
>>
>> *From: *Kyle Bendickson <kjbendick...@gmail.com>
>> *Date: *Sunday, June 12, 2022 at 1:55 PM
>> *To: *dev@iceberg.apache.org <dev@iceberg.apache.org>
>> *Subject: *Re: [VOTE] Adopt Puffin format as a file format for
>> statistics and indexes
>>
>> *EXTERNAL: Use caution when clicking on links or opening attachments.*
>>
>>
>>
>> +1 [non-binding]
>>
>>
>>
>> Thank you Piotr for all of the work you’ve put into this.
>>
>>
>>
>> This should greatly benefit not only Iceberg on Trino, but hopefully can
>> be used in many novel ways due to its well thought out generic design and
>> incorporation of the ability to extend with new sketches.
>>
>>
>>
>> Looking forward to the improvements this will bring.
>>
>>
>>
>> - Kyle
>>
>>
>>
>> On Fri, Jun 10, 2022 at 1:47 PM Alexander Jo <alex...@starburstdata.com>
>> wrote:
>>
>> +1, let's do it!
>>
>>
>>
>> On Fri, Jun 10, 2022 at 2:47 PM John Zhuge <jzh...@apache.org> wrote:
>>
>> +1  Looking forward to the features it enables.
>>
>>
>>
>> On Fri, Jun 10, 2022 at 10:11 AM Yufei Gu <flyrain...@gmail.com> wrote:
>>
>> +1. Looking forward to the partition stats.
>>
>> Best,
>>
>>
>>
>> Yufei
>>
>>
>>
>>
>>
>> On Thu, Jun 9, 2022 at 6:32 PM Daniel Weeks <dwe...@apache.org> wrote:
>>
>> +1 as well.  Excited about the progress here.
>>
>>
>>
>> -Dan
>>
>> On Thu, Jun 9, 2022, 6:25 PM Junjie Chen <chenjunjied...@gmail.com>
>> wrote:
>>
>> +1, really nice! Indexes are coming!
>>
>>
>>
>> On Fri, Jun 10, 2022 at 8:04 AM Szehon Ho <szehon.apa...@gmail.com>
>> wrote:
>>
>> +1, it's an exciting step for Iceberg, look forward to all the new
>> statistics and secondary indices it will allow.
>>
>>
>>
>> Had a few questions of what the reference to Puffin file(s) will be in
>> the Iceberg spec, but it's orthogonal to Puffin file format itself.
>>
>>
>>
>> Thanks,
>>
>> Szehon
>>
>>
>>
>> On Thu, Jun 9, 2022 at 3:32 PM Ryan Blue <b...@tabular.io> wrote:
>>
>> +1 from me!
>>
>>
>>
>> There may also be people that haven't followed the design discussions and
>> we can start a DISCUSS thread if needed. But if everyone is comfortable
>> with the design and implementation, I think it's ready for a vote as well.
>>
>>
>>
>> Huge thanks to Piotr for getting this ready! I think the format is going
>> to be really useful for both stats and indexes in Iceberg.
>>
>>
>>
>> On Thu, Jun 9, 2022 at 3:35 AM Piotr Findeisen <pi...@starburstdata.com>
>> wrote:
>>
>> Hi Everyone,
>>
>> I propose that we adopt Puffin file format as a file format for
>> statistics and indexes in Iceberg tables.
>>
>>
>>
>> Puffin file format specification:
>>
>> https://github.com/apache/iceberg/blob/master/format/puffin-spec.md
>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fblob%2Fmaster%2Fformat%2Fpuffin-spec.md&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=3Y04jqMw6ZIc%2BojDmWlpOeLL5zQ3YvLcdAgoHJTwL8c%3D&reserved=0>
>>
>> (previous discussions:  https://github.com/apache/iceberg/pull/4944
>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fpull%2F4944&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=tWuoyTfEaIWmOFivROQRt0fD1KRYc%2FqwRO2KoZhIoi8%3D&reserved=0>
>> , https://github.com/apache/iceberg-docs/pull/69
>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg-docs%2Fpull%2F69&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Uf8XiuLSLEO8YtCMkk%2BSXWS6lefw95O22K844P5Iovc%3D&reserved=0>
>> )
>>
>>
>>
>> Intend use:
>>
>> * statistics in Iceberg tables (see
>> https://github.com/apache/iceberg/pull/4945
>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fpull%2F4945&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=swByVgNPD6lbSlJjHIZZX4jgeVzC%2BT%2BWUvxrrg0Wpx8%3D&reserved=0>
>> and associated proposed implementation
>> https://github.com/apache/iceberg/pull/4741
>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fpull%2F4741&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=dYckyv1f36iQqs9%2FaRQRsumtB2xEmwcFJAQihYZRYlw%3D&reserved=0>
>> )
>>
>> * in the future: storage for secondary indexes
>>
>>
>>
>> Puffin file reader and writer implementation:
>>
>> https://github.com/apache/iceberg/pull/4537
>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fpull%2F4537&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=YY%2B52Eq%2FcnnseM5Nd4E0D3Xw8IWMsD4QaI98LXFMu9c%3D&reserved=0>
>>
>>
>>
>> Thanks,
>>
>> PF
>>
>>
>>
>>
>>
>>
>> --
>>
>> Ryan Blue
>>
>> Tabular
>>
>>
>>
>>
>> --
>>
>> Best Regards
>>
>>
>>
>>
>> --
>>
>> John Zhuge
>>
>>

Reply via email to