I'm going to revert the change to NaN tracking that makes that field required. I think we can make other fields required in table metadata.json files and manifests, but that one in the manifest list isn't a good idea. I'll open a PR to update it this weekend and I'll update the distinct counts PR from the discussion there. After that, I think we'll be ready to go.
On Fri, Jul 23, 2021 at 4:29 PM Anton Okolnychyi <aokolnyc...@apple.com.invalid> wrote: > For the last month, I’ve been actively working on using the v2 spec in > Spark. Specifically, my focus is to implement merge-on-read using the > proposed API in Spark [1]. That’s why I would support the idea of adopting > v2 as the current design is sufficient to implement considered use cases. I > expect to find bugs and hit performance issues but I think we will be able > to solve them without breaking the forward compatibility. > > I’ll need a bit more time to dig into the NaN counts issue that Dan > mentioned but I overall support the idea of adopting the v2 spec without > making it default (until we prove its correctness and performance). > > - Anton > > [1] - https://github.com/apache/spark/pull/33008 > > On 19 Jul 2021, at 17:01, Ryan Blue <b...@tabular.io> wrote: > > I'll reply inline: > > On Thu, Jul 15, 2021 at 1:46 PM Daniel Weeks <dwe...@apache.org> wrote: > >> Overall, I'm in favor of what Ryan has proposed for v2 above. There are >> a few points that I'm not entirely clear on, which may warrant discussion: >> >> 1) The discussion on resurfacing distinct_count may be something we want >> to include since there is some value. I think it warrants keeping and >> dropping the deprecation (see distinct count discussion in dev list). This >> actually isn't a change proposal to v2, but rather signifying that it will >> be carried forward. >> > > I don't think that this affects v2 because it is a non-breaking change to > forward-compatibility. I think of this as adding it back to v1 because we > can add optional fields without any problem. Older writers won't produce > it, but they never did before it was deprecated anyway. Older readers will > simply ignore it when reading files. > > >> 2) NaN counts would now be required for field_summary and I'm not clear >> on what that implies for tables being promoted from v1 to v2 since missing >> this information would require evaluating all of the partition values and >> rewriting all of the manifest lists where this info is missing (it's still >> optional for data files). I wasn't able to find the discussion on making >> this required, but I assume it is so that we can accurately apply filters >> at this level. Is this case handled? >> > > You might be right about not making this required. > > The intent was to change just the writer requirements. While that isn't > strictly necessary for some fields, it helps ensure that writers follow the > format as it evolves. But the problem is that the caller may not have this > information (whether any partition tuple contained NaN for that field) when > writing a new manifest list file because the old (v1) manifest list file > didn't include it. That would be known if the manifest was rewritten, but > may not be if the manifest is carried forward into a new manifest list. > > Making this required would mean that we need to set it to something when > writing. Setting it to `true` would be fine because if the field is > missing, we assume there may be a NaN value in a partition tuple and scan > the manifest when looking for NaNs. But it does seem strange to require > that default so I'd support removing this. > > I should also mention that even though this is optional for data files, we > should always know when writing a manifest. That's because this is about > the partition tuple, which is always present when writing a manifest. > > >> 3) If we're signaling the formal adoption of v2 by changing the default >> table created, do we want to do that in the next release (0.12 assuming the >> discussion/vote closes in time) or should we separate that with a second >> release (0.13 or 1.0) to separate all of the other changes going into the >> next release from the official format adoption? >> > > I don't think that we are signalling formal adoption by changing the > default. We are adopting by voting and will change the default when it > makes sense to do that based on real-world use. I think that adopting the > proposed v2 format means that we're confident that although there may be > implementation bugs, we are ready to support the new v2 format as we do v1. > > The Java release and v2 format adoption are somewhat orthogonal. We could > release 0.12.0 and when v2 is released later state that they are > compatible. I think the only question is whether we find anything that we > need to change in 0.12.0 in order to use v2 tables. That's why I think it > is good to adopt the spec without changing the default. People can opt into > using the new tables for the new use cases and we can get some real-world > use before changing the default. The main consequence of adoption is that > we commit to supporting tables written for the v2 spec (as opposed to > changing v2 in incompatible ways) and we start to guarantee > forward-compatibility (as opposed to making more optional -> required > changes). > > Ryan > > -- > Ryan Blue > Tabular > > > -- Ryan Blue Tabular