Re: [VOTE] Update partition stats spec for V3

2025-02-04 Thread Jean-Baptiste Onofré
+1 (non binding) Regards JB On Sat, Feb 1, 2025 at 3:01 AM Anton Okolnychyi wrote: > > Hi all, > > I propose the following updates to our partition stats spec in V3: > > - Modify `position_delete_record_count` to include a sum of position deletes > across position delete files and DVs > - Keep

Re: [VOTE] Update partition stats spec for V3

2025-02-04 Thread xianjin
+1, the spec change makes sense. > Make delete counts required to avoid ambiguity w.r.t NULL vs unknown. If we want to make this change, I think we need to unlink all the partitions stats files in old snapshots (if it's already calculated with optional delete counts) when upgrading to V3 table fr

Re: FileRewrite API refactor

2025-02-04 Thread Steven Wu
At a high level, it makes sense to separate out the planning and execution to promote reusing the planning code across engines. Just to add 4th class to Russel's list 1) RewriteGroup: A Container that holds all the files that are meant to be compacted along with information about them 2) Rewriter:

Re: [DISCUSS] Update supported blob types in puffin spec

2025-02-04 Thread Denys Kuzmenko
Thanks All for the reactions. I wanted to emphasize that Hive's StatsObject was shared as an example with the suggestion to adapt it for iceberg - `PartitionColumnStats` (i.e. use column ids and drop name/type, etc). As was mentioned by Rayan, column upper/lower bounds, counts, null value and

Re: [DISCUSS] Update supported blob types in puffin spec

2025-02-04 Thread Piotr Findeisen
Thanks Denys for starting this discussion! Thanks Ryan, i agree it would be better to have engine agnostic data structures in the Blobs we maintain in the Iceberg project. At least for the "standard blob types". Note however that Puffin format is intentionally open-ended. An application can put a

Re: [DISCUSS] Update supported blob types in puffin spec

2025-02-04 Thread rdb...@gmail.com
Thanks for proposing this. My main concern is that this doesn't seem to be aimed at standardizing this metadata, but rather a way to pass existing Hive structures in a different way. I commented on the PR, but I'll carry it over here for this discussion. Iceberg already supports tracking column l

Re: [VOTE] Update partition stats spec for V3

2025-02-04 Thread rdb...@gmail.com
+1 On Tue, Feb 4, 2025 at 12:46 AM Honah J. wrote: > +1 > > On Mon, Feb 3, 2025 at 11:42 PM Ajantha Bhat > wrote: > >> +1 >> >> On Tue, Feb 4, 2025 at 11:30 AM Eduard Tudenhöfner < >> etudenhoef...@apache.org> wrote: >> >>> +1 >>> >>> On Mon, Feb 3, 2025 at 8:33 PM Dongjoon Hyun >>> wrote: >>>

Re: [DISCUSS] Update supported blob types in puffin spec

2025-02-04 Thread Denys Kuzmenko
Hi Gabor, Thanks for your feedback! > In that use case however, we'd lose the stats we got previously from HMS For Iceberg tables Hive computes and stores the same stats object in a puffin file, previously persisted to HMS. So, there shouldn't be any changes for Impala other than changing the

Re: Very strange (AI generated) issues

2025-02-04 Thread Jarek Potiuk
Here is the article in The New Stack: https://thenewstack.io/ai-is-spamming-open-source-repos-with-fake-issues/ On Fri, Jan 31, 2025 at 12:39 PM Jarek Potiuk wrote: > Hey, > > I am at FOSDEM now but I have some progress and more information about the > whole stuf: > > Some facts first (without a

Re: [DISCUSS] Update supported blob types in puffin spec

2025-02-04 Thread Gabor Kaszab
Hi Denys, Thanks for raising this! I think extending the Puffin spec with additional columns stats would make sense. I saw the PR for the Puffin spec at some point late last year and I also had it in my plans to revive it in a way. My motivation is that Impala currently uses a lot of stats from H

Re: [DISCUSS] Update supported blob types in puffin spec

2025-02-04 Thread Denys Kuzmenko
There is an option to standardize Hive's ColStatistics object schema and use Iceberg: class ColStatistics { static class Range { Number minValue; Number maxValue; } String colName; String colType; long countDistinct; long numNulls; double avgColLen; long numTrues; lo

Re: [DISCUSS] Update supported blob types in puffin spec

2025-02-04 Thread Denys Kuzmenko
sorry, valid Doc PR link: https://github.com/apache/iceberg-docs/pull/269

[DISCUSS] Update supported blob types in puffin spec

2025-02-04 Thread Denys Kuzmenko
Hi Everyone, We'd like to discuss an extension to the supported blob types in puffin spec. Hive-4 uses statistics auto-generation to optimize Iceberg query performance. Column statistics are written to puffin files per snapshot. The statistics calculated by Hive include histograms, NDV (Number o

Re: [VOTE] Update partition stats spec for V3

2025-02-04 Thread Honah J.
+1 On Mon, Feb 3, 2025 at 11:42 PM Ajantha Bhat wrote: > +1 > > On Tue, Feb 4, 2025 at 11:30 AM Eduard Tudenhöfner < > etudenhoef...@apache.org> wrote: > >> +1 >> >> On Mon, Feb 3, 2025 at 8:33 PM Dongjoon Hyun wrote: >> >>> +1 for the proposal. >>> >>> Dongjoon >>> >>> > On Mon, Feb 3, 2025 at