Yeah, like Ryan said we are currently thinking about storing secondary indexes and sketches at the partition level. To do that, we're considering a new partition-granularity metadata file that has stats that are useful for job planning and pointers to indexes and sketches.
As for the sketches you suggest, I was thinking more about using Theta sketches instead because they support set intersection that would be helpful for joins. On Fri, Jul 23, 2021 at 2:35 AM Ryan Murray <rym...@gmail.com> wrote: > Hey Piotr, > > There are a few proposals around secondary indexes floating around[1][2]. > The current thinking is that this would be the best place for sketches to > live. > > Best, > Ryan > > [1] > https://docs.google.com/document/d/11o3T7XQVITY_5F9Vbri9lF9oJjDZKjHIso7K8tEaFfY/edit#heading=h.uqr5wcfm85p7 > [2] > https://docs.google.com/document/d/1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ/edit > > On Fri, Jul 23, 2021 at 11:11 AM Piotr Findeisen <pi...@starburstdata.com> > wrote: > >> Hi, >> >> File level distinct count (a number) has limited applicability in Trino. >> It's useful for pointed queries, where we can prune all the other files >> away, but in other cases, Trino optimizer wouldn't be able to make an >> educated use of that. >> >> Internally, Łukasz and I we were talking about sketches like HLL as well >> and i am happy to see them being mentioned here too. >> Do you have any design plans for that already? >> Did you consider making them part of file metadata? >> >> Of course for this to be useful, we would need to have a well defined >> hash function (we already have it for bucketing purposes), as well as >> portable representation that can be imported by a query engine. >> >> Best, >> PF >> >> >> >> >> >> On Sat, Jul 3, 2021 at 2:27 AM Ryan Blue <b...@tabular.io> wrote: >> >>> I feel it’s better to ensure as much correctness in the statistics as >>> possible and then to let the engines make educated decisions about how they >>> want to work on that information. >>> >>> I agree with this, but I’m wondering where the line is for “as much >>> correctness … as possible”. >>> >>> It hadn’t occurred to me that someone might compact two files and also >>> merge the distinct counts. In that case, I completely agree that we should >>> require that the distinct count should be based on the data and not on a >>> calculation from other distinct counts. >>> >>> But, I think it is probably important that these can be set from >>> sketches that don’t require keeping a set of full values. Otherwise, it >>> could be prohibitively expensive to produce them. >>> >>> It probably makes sense to be clear in the docs, like Jack suggests: >>> this is an estimate of the number of distinct values produced from the >>> actual data, not from merging estimates (but possibly from merging the >>> underlying sketches). >>> >>> On Fri, Jul 2, 2021 at 2:49 PM Jack Ye <yezhao...@gmail.com> wrote: >>> >>>> Yes I think Dan has a good point here that I was trying to get to, the >>>> correctness aspect of it is the major reason that led me to consider the >>>> upper and lower bound approach, otherwise as Ryan described, the current >>>> count metrics could already be sufficient for planning purposes. With a >>>> bound, at least that bound can always be calculated correctly during any >>>> operation, whereas the distinct value count might drift if we use some >>>> heuristics. So if we decide to go with adding the count, it should be a >>>> nullable count such that if we cannot decide the true value it can be >>>> omitted. Or it could directly be defined as a distinct value estimate, but >>>> in that case I would prefer to have it as some sort of sketch implemented >>>> as a secondary index. >>>> >>>> -Jack Ye >>>> >>>> On Fri, Jul 2, 2021 at 9:01 AM Daniel Weeks <dwe...@apache.org> wrote: >>>> >>>>> Jack, that's the same thought I had initially but I think we can >>>>> actually break this down into two separate issues. >>>>> >>>>> One is on the scan side which is how do we merge the information that >>>>> we have and I think that would you're describing is something that we can >>>>> do even without storing the lower and upper bounds. The advantage is that >>>>> on scan side you can actually use other information that you have in order >>>>> to make better decisions about how to do that merging. If all you have is >>>>> lower and upper bounds you may actually lose a little bit of that fidelity >>>>> based on previous merges, compactions, etc. >>>>> >>>>> On the file side I'm a little concerned about using statistics that >>>>> can drift over time. If we're merging files stats can quickly become >>>>> non-representative for the actual data in the files themselves. Beyond >>>>> merges even compactions can impact the actual stats within the file so in >>>>> many cases you would need to recalculate them anyway. >>>>> >>>>> I feel it's better to ensure as much correctness in the statistics as >>>>> possible and then to let the engines make educated decisions about how >>>>> they >>>>> want to work on that information. >>>>> >>>>> I'd vote for having distinct counts in the stats but requiring that >>>>> they be accurate. I feel like it's better to require that they're dropped >>>>> in the event that the cannot be accurate. This may cause some problems >>>>> with row-level deletes though so we have to be a little careful with the >>>>> implementation. >>>>> >>>>> -Dan >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Jul 2, 2021, 7:55 AM Jack Ye <yezhao...@gmail.com> wrote: >>>>> >>>>>> What about instead of distinct count, we introduce min and max >>>>>> possible distinct count? In the best case scenario, min and max equals, >>>>>> and >>>>>> we know exactly how many distinct values there are, and we can directly >>>>>> update the new distinct count. In the worst case, when merging 2 unsorted >>>>>> files, without the need for complicated estimation, we can know the max >>>>>> for >>>>>> file1 and file2 is (max_file1 + max_file2), and the min is max(min_file1, >>>>>> min_file2). If files are merged without sort order, then this gap will >>>>>> continue to grow and become unable to provide as much useful information >>>>>> to >>>>>> planning. But when we perform a sort for rows in files, we can update the >>>>>> min and max to the same and reduce this gap and improve planning. >>>>>> >>>>>> -Jack Ye >>>>>> >>>>>> On Thu, Jul 1, 2021 at 8:29 PM Daniel Weeks <dwe...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> I would agree with including distinct counts. >>>>>>> >>>>>>> As you point out there are a number of strategies that can be >>>>>>> employed by the engine based on additional information. You pointed out >>>>>>> the non-overlapping bounds, but similarly if the bounds overlap almost >>>>>>> entirely, you might be able to assume an even distribution and average >>>>>>> them. If the delta between lower and upper bounds overall are narrow, >>>>>>> you >>>>>>> might even be able to choose the max value (at least for whole numbers). >>>>>>> >>>>>>> Another alternative would be to use an approx distinct with some >>>>>>> form of sketch/digest that would allow for better merging, but I feel >>>>>>> the >>>>>>> tradeoff in space/complexity may not net out to better overall outcomes. >>>>>>> >>>>>>> -Dan >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Jul 1, 2021 at 5:58 PM Ryan Blue <b...@tabular.io> wrote: >>>>>>> >>>>>>>> Hi everyone, >>>>>>>> >>>>>>>> I'm working on finalizing the spec for v2 right now and one thing >>>>>>>> that's outstanding is the map of file-level distinct counts. >>>>>>>> >>>>>>>> This field has some history. I added it in the original spec >>>>>>>> because I thought we'd want distinct value counts for cost-based >>>>>>>> optimization in SQL planners. But we later removed it because the >>>>>>>> counts >>>>>>>> aren't mergeable, making it hard to determine what to do with >>>>>>>> file-level >>>>>>>> distinct counts. In some cases, you'd want to add them together (when >>>>>>>> sorted by the column) and in others you'd want to use the max across >>>>>>>> files. >>>>>>>> I thought that the idea of having counts was misguided, so we removed >>>>>>>> the >>>>>>>> column. >>>>>>>> >>>>>>>> I've recently talked with people working on SQL planners and they >>>>>>>> suggested adding the column back and populating it because even >>>>>>>> distinct >>>>>>>> counts that are hard to work with are better than nothing. >>>>>>>> >>>>>>>> There may also be heuristics for working with the counts that make >>>>>>>> it possible to get decent estimates across files. For example, if the >>>>>>>> column bounds do not overlap between files (like 0-10, 11-20, 21-30), >>>>>>>> that >>>>>>>> is an indication that the column is sorted and the distinct counts >>>>>>>> should >>>>>>>> be added together. >>>>>>>> >>>>>>>> Thanks to Yan, we now have a metrics framework we could use to >>>>>>>> populate these, although it would take some work to find a good way to >>>>>>>> estimate the distinct counts. For v2, should we add the distinct >>>>>>>> counts map >>>>>>>> back to file metadata and populate it? >>>>>>>> >>>>>>>> Rayn >>>>>>>> >>>>>>>> -- >>>>>>>> Ryan Blue >>>>>>>> Tabular >>>>>>>> >>>>>>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >> -- Ryan Blue Tabular