Re: [DISCUSS] Distinct count map

2021-07-02 Thread Jack Ye
What about instead of distinct count, we introduce min and max possible distinct count? In the best case scenario, min and max equals, and we know exactly how many distinct values there are, and we can directly update the new distinct count. In the worst case, when merging 2 unsorted files, without

Re: [DISCUSS] Distinct count map

2021-07-02 Thread Ryan Blue
I should be more clear about why I didn't suggest a sketch as an option here. Sketches like HLL that can be merged are too large to go into data file metadata. For sketches, I think the partition-level stats and secondary index structure is the right place for those. The question I want to think th

Re: [DISCUSS] Distinct count map

2021-07-02 Thread Ryan Blue
Jack, I don't think that it is very valuable to keep a range of distinct values for a single file. The count that we store will be close enough for planning purposes. The ranges you suggest to estimate the min and max distinct counts is exactly what I'm getting at: we can estimate the range based o

Re: [DISCUSS] Distinct count map

2021-07-02 Thread Daniel Weeks
Jack, that's the same thought I had initially but I think we can actually break this down into two separate issues. One is on the scan side which is how do we merge the information that we have and I think that would you're describing is something that we can do even without storing the lower and

Re: [DISCUSS] Distinct count map

2021-07-02 Thread Jack Ye
Yes I think Dan has a good point here that I was trying to get to, the correctness aspect of it is the major reason that led me to consider the upper and lower bound approach, otherwise as Ryan described, the current count metrics could already be sufficient for planning purposes. With a bound, at

Re: rowGroup:File = 1:1

2021-07-02 Thread Daniel Weeks
Hey Sreeram, I feel like some of your points about why there are row groups are valid, but there are some really good reasons why you might want to have multiple row groups in a file (and I can share some situations where it can be valuable). If you think historically about how distributed proces

Re: rowGroup:File = 1:1

2021-07-02 Thread Ryan Blue
Dan's email gives a lot of great background on why row groups exist and how they are still useful. I'd add that there are a few considerations for choosing a row group size that can affect the choice of how many row groups to target in a file. The first consideration is what Dan pointed out: larger

Re: [DISCUSS] Distinct count map

2021-07-02 Thread Ryan Blue
I feel it’s better to ensure as much correctness in the statistics as possible and then to let the engines make educated decisions about how they want to work on that information. I agree with this, but I’m wondering where the line is for “as much correctness … as possible”. It hadn’t occurred to