Hi Ajantha, Thank you for spending the time to look into this.
re a: I think I remember Ryan saying Parquet isn't good for bigger pieces of data, and some stats sketches or indices can be bigger than others. Also, the Parquet row logical / columnar storage format doesn't give as much benefit for what's more closer to key-value storage re b: this is still tbd -- eg https://github.com/apache/iceberg/pull/4945 https://github.com/apache/iceberg/pull/5021 re c, e: for partition-level, it's not decided yet how it will be handled re d: yes, ANALYZE can be separate operation, see https://github.com/trinodb/trino/pull/12317 for POC Best regards, PF On Tue, Jun 21, 2022 at 8:52 AM Ajantha Bhat <ajanthab...@gmail.com> wrote: > Thank you Piotr for all of the work you’ve put into this. > > I just checked the spec. I have a few newbie questions. > > a. Instead of using an existing columnar format like parquet (one file for > one type of stats) to store indexes, any reason why we have developed our > own format and any benchmarks taken against Puffin vs other formats? > > b. How these Puffin files are linked to Iceberg's metadata files is still > a missing link for me. As the Puffin spec says, these stats are table level > (updated per snapshots). So, do we need an Iceberg spec change to store the > file names of these Puffin files so that remove_orphan_files will not > clean it up accidentally? (also needed for expire_snapshots) > > c. NDV's are column level stats. So, I expect the latest puffin file of > that snapshot will have one row of stats representing stats for each > column. But if we are to implement secondary index or table level partition > stats, there can be many rows (millions) in puffin based on the dataset. > So, for every commit, do we need to read the previous snapshot's Puffin > file and write back a new file with updated stats? (the file might be very > huge when data grows?). I think it will affect the commit time. Any > thoughts on this? > > d. Slightly related to the above point, do we plan to asynchronously > support collecting the stats like "ANALYZE table" and modify the table > metadata with the stats file names? (might need an Iceberg commit to write > new table metadata) > > e. Even though table level partition stats are available from _parition > metadata table (along with filter push down support), computing metadata > table per query will be expensive. > Hence, we are looking forward to storing them in the Puffin format. But > I'm not sure about storing it as a single file with millions of rows. > I Would like to collaborate and discuss more on this. > > Thanks, > Ajantha > > On Mon, Jun 13, 2022 at 2:45 AM Miao Wang <miw...@adobe.com.invalid> > wrote: > >> +1 on the format! It looks great! >> >> >> >> Thanks for materializing the initial design idea. >> >> >> >> Miao >> >> *From: *Kyle Bendickson <kjbendick...@gmail.com> >> *Date: *Sunday, June 12, 2022 at 1:55 PM >> *To: *dev@iceberg.apache.org <dev@iceberg.apache.org> >> *Subject: *Re: [VOTE] Adopt Puffin format as a file format for >> statistics and indexes >> >> *EXTERNAL: Use caution when clicking on links or opening attachments.* >> >> >> >> +1 [non-binding] >> >> >> >> Thank you Piotr for all of the work you’ve put into this. >> >> >> >> This should greatly benefit not only Iceberg on Trino, but hopefully can >> be used in many novel ways due to its well thought out generic design and >> incorporation of the ability to extend with new sketches. >> >> >> >> Looking forward to the improvements this will bring. >> >> >> >> - Kyle >> >> >> >> On Fri, Jun 10, 2022 at 1:47 PM Alexander Jo <alex...@starburstdata.com> >> wrote: >> >> +1, let's do it! >> >> >> >> On Fri, Jun 10, 2022 at 2:47 PM John Zhuge <jzh...@apache.org> wrote: >> >> +1 Looking forward to the features it enables. >> >> >> >> On Fri, Jun 10, 2022 at 10:11 AM Yufei Gu <flyrain...@gmail.com> wrote: >> >> +1. Looking forward to the partition stats. >> >> Best, >> >> >> >> Yufei >> >> >> >> >> >> On Thu, Jun 9, 2022 at 6:32 PM Daniel Weeks <dwe...@apache.org> wrote: >> >> +1 as well. Excited about the progress here. >> >> >> >> -Dan >> >> On Thu, Jun 9, 2022, 6:25 PM Junjie Chen <chenjunjied...@gmail.com> >> wrote: >> >> +1, really nice! Indexes are coming! >> >> >> >> On Fri, Jun 10, 2022 at 8:04 AM Szehon Ho <szehon.apa...@gmail.com> >> wrote: >> >> +1, it's an exciting step for Iceberg, look forward to all the new >> statistics and secondary indices it will allow. >> >> >> >> Had a few questions of what the reference to Puffin file(s) will be in >> the Iceberg spec, but it's orthogonal to Puffin file format itself. >> >> >> >> Thanks, >> >> Szehon >> >> >> >> On Thu, Jun 9, 2022 at 3:32 PM Ryan Blue <b...@tabular.io> wrote: >> >> +1 from me! >> >> >> >> There may also be people that haven't followed the design discussions and >> we can start a DISCUSS thread if needed. But if everyone is comfortable >> with the design and implementation, I think it's ready for a vote as well. >> >> >> >> Huge thanks to Piotr for getting this ready! I think the format is going >> to be really useful for both stats and indexes in Iceberg. >> >> >> >> On Thu, Jun 9, 2022 at 3:35 AM Piotr Findeisen <pi...@starburstdata.com> >> wrote: >> >> Hi Everyone, >> >> I propose that we adopt Puffin file format as a file format for >> statistics and indexes in Iceberg tables. >> >> >> >> Puffin file format specification: >> >> https://github.com/apache/iceberg/blob/master/format/puffin-spec.md >> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fblob%2Fmaster%2Fformat%2Fpuffin-spec.md&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=3Y04jqMw6ZIc%2BojDmWlpOeLL5zQ3YvLcdAgoHJTwL8c%3D&reserved=0> >> >> (previous discussions: https://github.com/apache/iceberg/pull/4944 >> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fpull%2F4944&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=tWuoyTfEaIWmOFivROQRt0fD1KRYc%2FqwRO2KoZhIoi8%3D&reserved=0> >> , https://github.com/apache/iceberg-docs/pull/69 >> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg-docs%2Fpull%2F69&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Uf8XiuLSLEO8YtCMkk%2BSXWS6lefw95O22K844P5Iovc%3D&reserved=0> >> ) >> >> >> >> Intend use: >> >> * statistics in Iceberg tables (see >> https://github.com/apache/iceberg/pull/4945 >> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fpull%2F4945&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=swByVgNPD6lbSlJjHIZZX4jgeVzC%2BT%2BWUvxrrg0Wpx8%3D&reserved=0> >> and associated proposed implementation >> https://github.com/apache/iceberg/pull/4741 >> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fpull%2F4741&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=dYckyv1f36iQqs9%2FaRQRsumtB2xEmwcFJAQihYZRYlw%3D&reserved=0> >> ) >> >> * in the future: storage for secondary indexes >> >> >> >> Puffin file reader and writer implementation: >> >> https://github.com/apache/iceberg/pull/4537 >> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fpull%2F4537&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=YY%2B52Eq%2FcnnseM5Nd4E0D3Xw8IWMsD4QaI98LXFMu9c%3D&reserved=0> >> >> >> >> Thanks, >> >> PF >> >> >> >> >> >> >> -- >> >> Ryan Blue >> >> Tabular >> >> >> >> >> -- >> >> Best Regards >> >> >> >> >> -- >> >> John Zhuge >> >>