Re: [DISCUSS] Adopting the v2 spec changes

2021-07-23 Thread Ryan Blue
I'm going to revert the change to NaN tracking that makes that field required. I think we can make other fields required in table metadata.json files and manifests, but that one in the manifest list isn't a good idea. I'll open a PR to update it this weekend and I'll update the distinct counts PR f

Re: [DISCUSS] Adopting the v2 spec changes

2021-07-23 Thread Anton Okolnychyi
For the last month, I’ve been actively working on using the v2 spec in Spark. Specifically, my focus is to implement merge-on-read using the proposed API in Spark [1]. That’s why I would support the idea of adopting v2 as the current design is sufficient to implement considered use cases. I expe

Re: [DISCUSS] Distinct count map

2021-07-23 Thread Ryan Blue
The motivation is that some query engines want to at least estimate a min/max range for distinct value counts. Even if these are imperfect, at least it is better than no information. On Fri, Jul 23, 2021 at 4:08 PM Anton Okolnychyi wrote: > I am OK returning the metric back as long as it is base

Re: [DISCUSS] Distinct count map

2021-07-23 Thread Anton Okolnychyi
I am OK returning the metric back as long as it is based on writing data and is an approximation (to avoid too big performance and space overhead on write). It seems the biggest problem is that metric per file is not useful unless we query a single file. That’s why we should have an idea how th

Re: [DISCUSS] Distinct count map

2021-07-23 Thread Ryan Blue
Yeah, like Ryan said we are currently thinking about storing secondary indexes and sketches at the partition level. To do that, we're considering a new partition-granularity metadata file that has stats that are useful for job planning and pointers to indexes and sketches. As for the sketches you

Re: [DISCUSS] Distinct count map

2021-07-23 Thread Ryan Murray
Hey Piotr, There are a few proposals around secondary indexes floating around[1][2]. The current thinking is that this would be the best place for sketches to live. Best, Ryan [1] https://docs.google.com/document/d/11o3T7XQVITY_5F9Vbri9lF9oJjDZKjHIso7K8tEaFfY/edit#heading=h.uqr5wcfm85p7 [2] http

Re: [DISCUSS] Distinct count map

2021-07-23 Thread Piotr Findeisen
Hi, File level distinct count (a number) has limited applicability in Trino. It's useful for pointed queries, where we can prune all the other files away, but in other cases, Trino optimizer wouldn't be able to make an educated use of that. Internally, Łukasz and I we were talking about sketches