Hey Piotr, There are a few proposals around secondary indexes floating around[1][2]. The current thinking is that this would be the best place for sketches to live.
Best, Ryan [1] https://docs.google.com/document/d/11o3T7XQVITY_5F9Vbri9lF9oJjDZKjHIso7K8tEaFfY/edit#heading=h.uqr5wcfm85p7 [2] https://docs.google.com/document/d/1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ/edit On Fri, Jul 23, 2021 at 11:11 AM Piotr Findeisen <pi...@starburstdata.com> wrote: > Hi, > > File level distinct count (a number) has limited applicability in Trino. > It's useful for pointed queries, where we can prune all the other files > away, but in other cases, Trino optimizer wouldn't be able to make an > educated use of that. > > Internally, Łukasz and I we were talking about sketches like HLL as well > and i am happy to see them being mentioned here too. > Do you have any design plans for that already? > Did you consider making them part of file metadata? > > Of course for this to be useful, we would need to have a well defined hash > function (we already have it for bucketing purposes), as well as portable > representation that can be imported by a query engine. > > Best, > PF > > > > > > On Sat, Jul 3, 2021 at 2:27 AM Ryan Blue <b...@tabular.io> wrote: > >> I feel it’s better to ensure as much correctness in the statistics as >> possible and then to let the engines make educated decisions about how they >> want to work on that information. >> >> I agree with this, but I’m wondering where the line is for “as much >> correctness … as possible”. >> >> It hadn’t occurred to me that someone might compact two files and also >> merge the distinct counts. In that case, I completely agree that we should >> require that the distinct count should be based on the data and not on a >> calculation from other distinct counts. >> >> But, I think it is probably important that these can be set from sketches >> that don’t require keeping a set of full values. Otherwise, it could be >> prohibitively expensive to produce them. >> >> It probably makes sense to be clear in the docs, like Jack suggests: this >> is an estimate of the number of distinct values produced from the actual >> data, not from merging estimates (but possibly from merging the underlying >> sketches). >> >> On Fri, Jul 2, 2021 at 2:49 PM Jack Ye <yezhao...@gmail.com> wrote: >> >>> Yes I think Dan has a good point here that I was trying to get to, the >>> correctness aspect of it is the major reason that led me to consider the >>> upper and lower bound approach, otherwise as Ryan described, the current >>> count metrics could already be sufficient for planning purposes. With a >>> bound, at least that bound can always be calculated correctly during any >>> operation, whereas the distinct value count might drift if we use some >>> heuristics. So if we decide to go with adding the count, it should be a >>> nullable count such that if we cannot decide the true value it can be >>> omitted. Or it could directly be defined as a distinct value estimate, but >>> in that case I would prefer to have it as some sort of sketch implemented >>> as a secondary index. >>> >>> -Jack Ye >>> >>> On Fri, Jul 2, 2021 at 9:01 AM Daniel Weeks <dwe...@apache.org> wrote: >>> >>>> Jack, that's the same thought I had initially but I think we can >>>> actually break this down into two separate issues. >>>> >>>> One is on the scan side which is how do we merge the information that >>>> we have and I think that would you're describing is something that we can >>>> do even without storing the lower and upper bounds. The advantage is that >>>> on scan side you can actually use other information that you have in order >>>> to make better decisions about how to do that merging. If all you have is >>>> lower and upper bounds you may actually lose a little bit of that fidelity >>>> based on previous merges, compactions, etc. >>>> >>>> On the file side I'm a little concerned about using statistics that can >>>> drift over time. If we're merging files stats can quickly become >>>> non-representative for the actual data in the files themselves. Beyond >>>> merges even compactions can impact the actual stats within the file so in >>>> many cases you would need to recalculate them anyway. >>>> >>>> I feel it's better to ensure as much correctness in the statistics as >>>> possible and then to let the engines make educated decisions about how they >>>> want to work on that information. >>>> >>>> I'd vote for having distinct counts in the stats but requiring that >>>> they be accurate. I feel like it's better to require that they're dropped >>>> in the event that the cannot be accurate. This may cause some problems >>>> with row-level deletes though so we have to be a little careful with the >>>> implementation. >>>> >>>> -Dan >>>> >>>> >>>> >>>> >>>> On Fri, Jul 2, 2021, 7:55 AM Jack Ye <yezhao...@gmail.com> wrote: >>>> >>>>> What about instead of distinct count, we introduce min and max >>>>> possible distinct count? In the best case scenario, min and max equals, >>>>> and >>>>> we know exactly how many distinct values there are, and we can directly >>>>> update the new distinct count. In the worst case, when merging 2 unsorted >>>>> files, without the need for complicated estimation, we can know the max >>>>> for >>>>> file1 and file2 is (max_file1 + max_file2), and the min is max(min_file1, >>>>> min_file2). If files are merged without sort order, then this gap will >>>>> continue to grow and become unable to provide as much useful information >>>>> to >>>>> planning. But when we perform a sort for rows in files, we can update the >>>>> min and max to the same and reduce this gap and improve planning. >>>>> >>>>> -Jack Ye >>>>> >>>>> On Thu, Jul 1, 2021 at 8:29 PM Daniel Weeks <dwe...@apache.org> wrote: >>>>> >>>>>> I would agree with including distinct counts. >>>>>> >>>>>> As you point out there are a number of strategies that can be >>>>>> employed by the engine based on additional information. You pointed out >>>>>> the non-overlapping bounds, but similarly if the bounds overlap almost >>>>>> entirely, you might be able to assume an even distribution and average >>>>>> them. If the delta between lower and upper bounds overall are narrow, >>>>>> you >>>>>> might even be able to choose the max value (at least for whole numbers). >>>>>> >>>>>> Another alternative would be to use an approx distinct with some form >>>>>> of sketch/digest that would allow for better merging, but I feel the >>>>>> tradeoff in space/complexity may not net out to better overall outcomes. >>>>>> >>>>>> -Dan >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Jul 1, 2021 at 5:58 PM Ryan Blue <b...@tabular.io> wrote: >>>>>> >>>>>>> Hi everyone, >>>>>>> >>>>>>> I'm working on finalizing the spec for v2 right now and one thing >>>>>>> that's outstanding is the map of file-level distinct counts. >>>>>>> >>>>>>> This field has some history. I added it in the original spec because >>>>>>> I thought we'd want distinct value counts for cost-based optimization in >>>>>>> SQL planners. But we later removed it because the counts aren't >>>>>>> mergeable, >>>>>>> making it hard to determine what to do with file-level distinct counts. >>>>>>> In >>>>>>> some cases, you'd want to add them together (when sorted by the column) >>>>>>> and >>>>>>> in others you'd want to use the max across files. I thought that the >>>>>>> idea >>>>>>> of having counts was misguided, so we removed the column. >>>>>>> >>>>>>> I've recently talked with people working on SQL planners and they >>>>>>> suggested adding the column back and populating it because even distinct >>>>>>> counts that are hard to work with are better than nothing. >>>>>>> >>>>>>> There may also be heuristics for working with the counts that make >>>>>>> it possible to get decent estimates across files. For example, if the >>>>>>> column bounds do not overlap between files (like 0-10, 11-20, 21-30), >>>>>>> that >>>>>>> is an indication that the column is sorted and the distinct counts >>>>>>> should >>>>>>> be added together. >>>>>>> >>>>>>> Thanks to Yan, we now have a metrics framework we could use to >>>>>>> populate these, although it would take some work to find a good way to >>>>>>> estimate the distinct counts. For v2, should we add the distinct counts >>>>>>> map >>>>>>> back to file metadata and populate it? >>>>>>> >>>>>>> Rayn >>>>>>> >>>>>>> -- >>>>>>> Ryan Blue >>>>>>> Tabular >>>>>>> >>>>>> >> >> -- >> Ryan Blue >> Tabular >> >