The motivation is that some query engines want to at least estimate a min/max range for distinct value counts. Even if these are imperfect, at least it is better than no information.
On Fri, Jul 23, 2021 at 4:08 PM Anton Okolnychyi <aokolnyc...@apple.com.invalid> wrote: > I am OK returning the metric back as long as it is based on writing data > and is an approximation (to avoid too big performance and space overhead on > write). > > It seems the biggest problem is that metric per file is not useful unless > we query a single file. That’s why we should have an idea how this per-file > metric is going to be used by query engines. > > Piotr, what kind of information will be useful for the Trino optimizer? Is > it per split or per partition? > > Also a question to Iceberg folks, do we expect to use the per-file metric > to build partition-level numbers? Or will it be a separate process? > If query engines would rather benefit from partition-level stats and > file-level metrics will not help in building partition-level stats, I am > not sure adding the per-file column back will give us much. > > - Anton > > On 23 Jul 2021, at 08:58, Ryan Blue <b...@tabular.io> wrote: > > Yeah, like Ryan said we are currently thinking about storing secondary > indexes and sketches at the partition level. To do that, we're considering > a new partition-granularity metadata file that has stats that are useful > for job planning and pointers to indexes and sketches. > > As for the sketches you suggest, I was thinking more about using Theta > sketches instead because they support set intersection that would be > helpful for joins. > > On Fri, Jul 23, 2021 at 2:35 AM Ryan Murray <rym...@gmail.com> wrote: > >> Hey Piotr, >> >> There are a few proposals around secondary indexes floating around[1][2]. >> The current thinking is that this would be the best place for sketches to >> live. >> >> Best, >> Ryan >> >> [1] >> https://docs.google.com/document/d/11o3T7XQVITY_5F9Vbri9lF9oJjDZKjHIso7K8tEaFfY/edit#heading=h.uqr5wcfm85p7 >> [2] >> https://docs.google.com/document/d/1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ/edit >> >> On Fri, Jul 23, 2021 at 11:11 AM Piotr Findeisen <pi...@starburstdata.com> >> wrote: >> >>> Hi, >>> >>> File level distinct count (a number) has limited applicability in Trino. >>> It's useful for pointed queries, where we can prune all the other files >>> away, but in other cases, Trino optimizer wouldn't be able to make an >>> educated use of that. >>> >>> Internally, Łukasz and I we were talking about sketches like HLL as well >>> and i am happy to see them being mentioned here too. >>> Do you have any design plans for that already? >>> Did you consider making them part of file metadata? >>> >>> Of course for this to be useful, we would need to have a well defined >>> hash function (we already have it for bucketing purposes), as well as >>> portable representation that can be imported by a query engine. >>> >>> Best, >>> PF >>> >>> >>> >>> >>> >>> On Sat, Jul 3, 2021 at 2:27 AM Ryan Blue <b...@tabular.io> wrote: >>> >>>> I feel it’s better to ensure as much correctness in the statistics as >>>> possible and then to let the engines make educated decisions about how they >>>> want to work on that information. >>>> >>>> I agree with this, but I’m wondering where the line is for “as much >>>> correctness … as possible”. >>>> >>>> It hadn’t occurred to me that someone might compact two files and also >>>> merge the distinct counts. In that case, I completely agree that we should >>>> require that the distinct count should be based on the data and not on a >>>> calculation from other distinct counts. >>>> >>>> But, I think it is probably important that these can be set from >>>> sketches that don’t require keeping a set of full values. Otherwise, it >>>> could be prohibitively expensive to produce them. >>>> >>>> It probably makes sense to be clear in the docs, like Jack suggests: >>>> this is an estimate of the number of distinct values produced from the >>>> actual data, not from merging estimates (but possibly from merging the >>>> underlying sketches). >>>> >>>> On Fri, Jul 2, 2021 at 2:49 PM Jack Ye <yezhao...@gmail.com> wrote: >>>> >>>>> Yes I think Dan has a good point here that I was trying to get to, the >>>>> correctness aspect of it is the major reason that led me to consider the >>>>> upper and lower bound approach, otherwise as Ryan described, the current >>>>> count metrics could already be sufficient for planning purposes. With a >>>>> bound, at least that bound can always be calculated correctly during any >>>>> operation, whereas the distinct value count might drift if we use some >>>>> heuristics. So if we decide to go with adding the count, it should be a >>>>> nullable count such that if we cannot decide the true value it can be >>>>> omitted. Or it could directly be defined as a distinct value estimate, but >>>>> in that case I would prefer to have it as some sort of sketch implemented >>>>> as a secondary index. >>>>> >>>>> -Jack Ye >>>>> >>>>> On Fri, Jul 2, 2021 at 9:01 AM Daniel Weeks <dwe...@apache.org> wrote: >>>>> >>>>>> Jack, that's the same thought I had initially but I think we can >>>>>> actually break this down into two separate issues. >>>>>> >>>>>> One is on the scan side which is how do we merge the information that >>>>>> we have and I think that would you're describing is something that we can >>>>>> do even without storing the lower and upper bounds. The advantage is >>>>>> that >>>>>> on scan side you can actually use other information that you have in >>>>>> order >>>>>> to make better decisions about how to do that merging. If all you have is >>>>>> lower and upper bounds you may actually lose a little bit of that >>>>>> fidelity >>>>>> based on previous merges, compactions, etc. >>>>>> >>>>>> On the file side I'm a little concerned about using statistics that >>>>>> can drift over time. If we're merging files stats can quickly become >>>>>> non-representative for the actual data in the files themselves. Beyond >>>>>> merges even compactions can impact the actual stats within the file so in >>>>>> many cases you would need to recalculate them anyway. >>>>>> >>>>>> I feel it's better to ensure as much correctness in the statistics as >>>>>> possible and then to let the engines make educated decisions about how >>>>>> they >>>>>> want to work on that information. >>>>>> >>>>>> I'd vote for having distinct counts in the stats but requiring that >>>>>> they be accurate. I feel like it's better to require that they're dropped >>>>>> in the event that the cannot be accurate. This may cause some problems >>>>>> with row-level deletes though so we have to be a little careful with the >>>>>> implementation. >>>>>> >>>>>> -Dan >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Jul 2, 2021, 7:55 AM Jack Ye <yezhao...@gmail.com> wrote: >>>>>> >>>>>>> What about instead of distinct count, we introduce min and max >>>>>>> possible distinct count? In the best case scenario, min and max equals, >>>>>>> and >>>>>>> we know exactly how many distinct values there are, and we can directly >>>>>>> update the new distinct count. In the worst case, when merging 2 >>>>>>> unsorted >>>>>>> files, without the need for complicated estimation, we can know the max >>>>>>> for >>>>>>> file1 and file2 is (max_file1 + max_file2), and the min is >>>>>>> max(min_file1, >>>>>>> min_file2). If files are merged without sort order, then this gap will >>>>>>> continue to grow and become unable to provide as much useful >>>>>>> information to >>>>>>> planning. But when we perform a sort for rows in files, we can update >>>>>>> the >>>>>>> min and max to the same and reduce this gap and improve planning. >>>>>>> >>>>>>> -Jack Ye >>>>>>> >>>>>>> On Thu, Jul 1, 2021 at 8:29 PM Daniel Weeks <dwe...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> I would agree with including distinct counts. >>>>>>>> >>>>>>>> As you point out there are a number of strategies that can be >>>>>>>> employed by the engine based on additional information. You pointed >>>>>>>> out >>>>>>>> the non-overlapping bounds, but similarly if the bounds overlap almost >>>>>>>> entirely, you might be able to assume an even distribution and average >>>>>>>> them. If the delta between lower and upper bounds overall are narrow, >>>>>>>> you >>>>>>>> might even be able to choose the max value (at least for whole >>>>>>>> numbers). >>>>>>>> >>>>>>>> Another alternative would be to use an approx distinct with some >>>>>>>> form of sketch/digest that would allow for better merging, but I feel >>>>>>>> the >>>>>>>> tradeoff in space/complexity may not net out to better overall >>>>>>>> outcomes. >>>>>>>> >>>>>>>> -Dan >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jul 1, 2021 at 5:58 PM Ryan Blue <b...@tabular.io> wrote: >>>>>>>> >>>>>>>>> Hi everyone, >>>>>>>>> >>>>>>>>> I'm working on finalizing the spec for v2 right now and one thing >>>>>>>>> that's outstanding is the map of file-level distinct counts. >>>>>>>>> >>>>>>>>> This field has some history. I added it in the original spec >>>>>>>>> because I thought we'd want distinct value counts for cost-based >>>>>>>>> optimization in SQL planners. But we later removed it because the >>>>>>>>> counts >>>>>>>>> aren't mergeable, making it hard to determine what to do with >>>>>>>>> file-level >>>>>>>>> distinct counts. In some cases, you'd want to add them together (when >>>>>>>>> sorted by the column) and in others you'd want to use the max across >>>>>>>>> files. >>>>>>>>> I thought that the idea of having counts was misguided, so we removed >>>>>>>>> the >>>>>>>>> column. >>>>>>>>> >>>>>>>>> I've recently talked with people working on SQL planners and they >>>>>>>>> suggested adding the column back and populating it because even >>>>>>>>> distinct >>>>>>>>> counts that are hard to work with are better than nothing. >>>>>>>>> >>>>>>>>> There may also be heuristics for working with the counts that make >>>>>>>>> it possible to get decent estimates across files. For example, if the >>>>>>>>> column bounds do not overlap between files (like 0-10, 11-20, 21-30), >>>>>>>>> that >>>>>>>>> is an indication that the column is sorted and the distinct counts >>>>>>>>> should >>>>>>>>> be added together. >>>>>>>>> >>>>>>>>> Thanks to Yan, we now have a metrics framework we could use to >>>>>>>>> populate these, although it would take some work to find a good way to >>>>>>>>> estimate the distinct counts. For v2, should we add the distinct >>>>>>>>> counts map >>>>>>>>> back to file metadata and populate it? >>>>>>>>> >>>>>>>>> Rayn >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Ryan Blue >>>>>>>>> Tabular >>>>>>>>> >>>>>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Tabular >>>> >>> > > -- > Ryan Blue > Tabular > > > -- Ryan Blue Tabular