I'll reply inline: On Thu, Jul 15, 2021 at 1:46 PM Daniel Weeks <dwe...@apache.org> wrote:
> Overall, I'm in favor of what Ryan has proposed for v2 above. There are a > few points that I'm not entirely clear on, which may warrant discussion: > > 1) The discussion on resurfacing distinct_count may be something we want > to include since there is some value. I think it warrants keeping and > dropping the deprecation (see distinct count discussion in dev list). This > actually isn't a change proposal to v2, but rather signifying that it will > be carried forward. > I don't think that this affects v2 because it is a non-breaking change to forward-compatibility. I think of this as adding it back to v1 because we can add optional fields without any problem. Older writers won't produce it, but they never did before it was deprecated anyway. Older readers will simply ignore it when reading files. > 2) NaN counts would now be required for field_summary and I'm not clear > on what that implies for tables being promoted from v1 to v2 since missing > this information would require evaluating all of the partition values and > rewriting all of the manifest lists where this info is missing (it's still > optional for data files). I wasn't able to find the discussion on making > this required, but I assume it is so that we can accurately apply filters > at this level. Is this case handled? > You might be right about not making this required. The intent was to change just the writer requirements. While that isn't strictly necessary for some fields, it helps ensure that writers follow the format as it evolves. But the problem is that the caller may not have this information (whether any partition tuple contained NaN for that field) when writing a new manifest list file because the old (v1) manifest list file didn't include it. That would be known if the manifest was rewritten, but may not be if the manifest is carried forward into a new manifest list. Making this required would mean that we need to set it to something when writing. Setting it to `true` would be fine because if the field is missing, we assume there may be a NaN value in a partition tuple and scan the manifest when looking for NaNs. But it does seem strange to require that default so I'd support removing this. I should also mention that even though this is optional for data files, we should always know when writing a manifest. That's because this is about the partition tuple, which is always present when writing a manifest. > 3) If we're signaling the formal adoption of v2 by changing the default > table created, do we want to do that in the next release (0.12 assuming the > discussion/vote closes in time) or should we separate that with a second > release (0.13 or 1.0) to separate all of the other changes going into the > next release from the official format adoption? > I don't think that we are signalling formal adoption by changing the default. We are adopting by voting and will change the default when it makes sense to do that based on real-world use. I think that adopting the proposed v2 format means that we're confident that although there may be implementation bugs, we are ready to support the new v2 format as we do v1. The Java release and v2 format adoption are somewhat orthogonal. We could release 0.12.0 and when v2 is released later state that they are compatible. I think the only question is whether we find anything that we need to change in 0.12.0 in order to use v2 tables. That's why I think it is good to adopt the spec without changing the default. People can opt into using the new tables for the new use cases and we can get some real-world use before changing the default. The main consequence of adoption is that we commit to supporting tables written for the v2 spec (as opposed to changing v2 in incompatible ways) and we start to guarantee forward-compatibility (as opposed to making more optional -> required changes). Ryan -- Ryan Blue Tabular