Speaking for Dremio, I checked and we're not using distinct_counts anywhere, we interact with manifests exclusively through the Iceberg Java API which as mentioned doesn't support this field. I'm in favor of removing it, I didn't even know it existed as I tend to look at the Java DataFile/ContentFile interfaces when browsing the metadata structure vs. going to the spec 😂
On Mon, Feb 24, 2025 at 3:00 PM rdb...@gmail.com <rdb...@gmail.com> wrote: > I can provide some context here. The field is very old and when we > realized that it was not only unused but also difficult to produce and use > in practice (can't be combined) we deprecated the field. However, some > folks from Dremio wanted to bring it back because they said they could > store values there and had a way to use them. > > +1, but it would be good to check in with some Dremio engineers and see if > they are using it. I assume they aren't since this thread hasn't gotten > much attention. Thanks for bringing this up! > > On Thu, Feb 13, 2025 at 8:02 AM Jacob Marble > <jacobmar...@firetiger.com.invalid> wrote: > >> Xuanwo, do you favor deprecating or removing `distinct_count`? >> >> Due to lack of any real implementation, I myself favor removal (PR 12183). >> >> Jacob Marble >> 🔥🐅 >> >> >> On Tue, Feb 11, 2025 at 10:25 PM Xuanwo <xua...@apache.org> wrote: >> >>> Here is my +1 binding. >>> >>> The current status of `distinct_count` is quite confusing, which has >>> also led to additional discussions in `iceberg-rust` about whether we need >>> to add it and how to maintain it. >>> >>> Removing it seems reasonable to me, as there are no known use cases for >>> `distinct_count` in a single data file. >>> >>> On Tue, Feb 11, 2025, at 23:05, Fokko Driesprong wrote: >>> >>> My mistake, I suggested sending out an email with a quick vote on the >>> PR. I like the suggestion to use this thread for discussion since the >>> number of options is limited. >>> >>> I'm in favor of deprecating the field, to avoid that we re-use the >>> field-id in the future. >>> >>> Kind regards, >>> Fokko >>> >>> Op di 11 feb 2025 om 05:46 schreef Manu Zhang <owenzhang1...@gmail.com>: >>> >>> Hi Jacob, >>> >>> Thanks for initiating the vote. >>> Typically, we would first have a DISCUSSION thread to reach a consensus >>> on the preferred option and then follow it up with a VOTE thread for >>> confirmation. >>> >>> Maybe we can take this as a DISCUSSION thread? >>> >>> Best, >>> Manu >>> >>> >>> On Tue, Feb 11, 2025 at 7:20 AM Jacob Marble >>> <jacobmar...@firetiger.com.invalid> wrote: >>> >>> This vote will be open for at least 72 hours. >>> >>> I propose that distinct_counts be either deprecated (#12182 >>> <https://github.com/apache/iceberg/pull/12182>) or removed (#12183 >>> <https://github.com/apache/iceberg/pull/12183>) from the spec. >>> >>> According to #767 <https://github.com/apache/iceberg/issues/767> >>> data_file.distinct_counts was deprecated about four years ago. Furthermore, >>> it not implemented in the canonical Java and Python implementations >>> >>> Please share your thoughts, and vote one of the following: >>> - remove >>> - deprecate >>> - no-op >>> >>> Jacob Marble >>> 🔥🐅 >>> >>> Xuanwo >>> >>> https://xuanwo.io/ >>> >>>