Re: [DISCUSS] Distinct count map

2021-07-27 Thread Piotr Findeisen
Hi, Ryan, i agree on the NDV estimate (approximated number of distinct values) goal. Anton, re your question about Trino optimizer. I don't want to be overly focused on supporting current state to the extent that we make the future extensions impossible. On the other hand, it's worth knowing what

Re: [DISCUSS] Distinct count map

2021-07-23 Thread Ryan Blue
The motivation is that some query engines want to at least estimate a min/max range for distinct value counts. Even if these are imperfect, at least it is better than no information. On Fri, Jul 23, 2021 at 4:08 PM Anton Okolnychyi wrote: > I am OK returning the metric back as long as it is base

Re: [DISCUSS] Distinct count map

2021-07-23 Thread Anton Okolnychyi
I am OK returning the metric back as long as it is based on writing data and is an approximation (to avoid too big performance and space overhead on write). It seems the biggest problem is that metric per file is not useful unless we query a single file. That’s why we should have an idea how th

Re: [DISCUSS] Distinct count map

2021-07-23 Thread Ryan Blue
Yeah, like Ryan said we are currently thinking about storing secondary indexes and sketches at the partition level. To do that, we're considering a new partition-granularity metadata file that has stats that are useful for job planning and pointers to indexes and sketches. As for the sketches you

Re: [DISCUSS] Distinct count map

2021-07-23 Thread Ryan Murray
Hey Piotr, There are a few proposals around secondary indexes floating around[1][2]. The current thinking is that this would be the best place for sketches to live. Best, Ryan [1] https://docs.google.com/document/d/11o3T7XQVITY_5F9Vbri9lF9oJjDZKjHIso7K8tEaFfY/edit#heading=h.uqr5wcfm85p7 [2] http

Re: [DISCUSS] Distinct count map

2021-07-23 Thread Piotr Findeisen
Hi, File level distinct count (a number) has limited applicability in Trino. It's useful for pointed queries, where we can prune all the other files away, but in other cases, Trino optimizer wouldn't be able to make an educated use of that. Internally, Łukasz and I we were talking about sketches

Re: [DISCUSS] Distinct count map

2021-07-02 Thread Ryan Blue
I feel it’s better to ensure as much correctness in the statistics as possible and then to let the engines make educated decisions about how they want to work on that information. I agree with this, but I’m wondering where the line is for “as much correctness … as possible”. It hadn’t occurred to

Re: [DISCUSS] Distinct count map

2021-07-02 Thread Jack Ye
Yes I think Dan has a good point here that I was trying to get to, the correctness aspect of it is the major reason that led me to consider the upper and lower bound approach, otherwise as Ryan described, the current count metrics could already be sufficient for planning purposes. With a bound, at

Re: [DISCUSS] Distinct count map

2021-07-02 Thread Daniel Weeks
Jack, that's the same thought I had initially but I think we can actually break this down into two separate issues. One is on the scan side which is how do we merge the information that we have and I think that would you're describing is something that we can do even without storing the lower and

Re: [DISCUSS] Distinct count map

2021-07-02 Thread Ryan Blue
Jack, I don't think that it is very valuable to keep a range of distinct values for a single file. The count that we store will be close enough for planning purposes. The ranges you suggest to estimate the min and max distinct counts is exactly what I'm getting at: we can estimate the range based o

Re: [DISCUSS] Distinct count map

2021-07-02 Thread Ryan Blue
I should be more clear about why I didn't suggest a sketch as an option here. Sketches like HLL that can be merged are too large to go into data file metadata. For sketches, I think the partition-level stats and secondary index structure is the right place for those. The question I want to think th

Re: [DISCUSS] Distinct count map

2021-07-02 Thread Jack Ye
What about instead of distinct count, we introduce min and max possible distinct count? In the best case scenario, min and max equals, and we know exactly how many distinct values there are, and we can directly update the new distinct count. In the worst case, when merging 2 unsorted files, without

Re: [DISCUSS] Distinct count map

2021-07-01 Thread Daniel Weeks
I would agree with including distinct counts. As you point out there are a number of strategies that can be employed by the engine based on additional information. You pointed out the non-overlapping bounds, but similarly if the bounds overlap almost entirely, you might be able to assume an even