Re: [DISCUSS] Adopting the v2 spec changes

Ryan Blue Fri, 23 Jul 2021 16:37:41 -0700

I'm going to revert the change to NaN tracking that makes that field
required. I think we can make other fields required in table metadata.json
files and manifests, but that one in the manifest list isn't a good idea.
I'll open a PR to update it this weekend and I'll update the distinct
counts PR from the discussion there. After that, I think we'll be ready to
go.


On Fri, Jul 23, 2021 at 4:29 PM Anton Okolnychyi
<aokolnyc...@apple.com.invalid> wrote:

> For the last month, I’ve been actively working on using the v2 spec in
> Spark. Specifically, my focus is to implement merge-on-read using the
> proposed API in Spark [1]. That’s why I would support the idea of adopting
> v2 as the current design is sufficient to implement considered use cases. I
> expect to find bugs and hit performance issues but I think we will be able
> to solve them without breaking the forward compatibility.
>
> I’ll need a bit more time to dig into the NaN counts issue that Dan
> mentioned but I overall support the idea of adopting the v2 spec without
> making it default (until we prove its correctness and performance).
>
> - Anton
>
> [1] - https://github.com/apache/spark/pull/33008
>
> On 19 Jul 2021, at 17:01, Ryan Blue <b...@tabular.io> wrote:
>
> I'll reply inline:
>
> On Thu, Jul 15, 2021 at 1:46 PM Daniel Weeks <dwe...@apache.org> wrote:
>
>> Overall, I'm in favor of what Ryan has proposed for v2 above.  There are
>> a few points that I'm not entirely clear on, which may warrant discussion:
>>
>> 1)  The discussion on resurfacing distinct_count may be something we want
>> to include since there is some value.  I think it warrants keeping and
>> dropping the deprecation (see distinct count discussion in dev list).  This
>> actually isn't a change proposal to v2, but rather signifying that it will
>> be carried forward.
>>
>
> I don't think that this affects v2 because it is a non-breaking change to
> forward-compatibility. I think of this as adding it back to v1 because we
> can add optional fields without any problem. Older writers won't produce
> it, but they never did before it was deprecated anyway. Older readers will
> simply ignore it when reading files.
>
>
>> 2)  NaN counts would now be required for field_summary and I'm not clear
>> on what that implies for tables being promoted from v1 to v2 since missing
>> this information would require evaluating all of the partition values and
>> rewriting all of the manifest lists where this info is missing (it's still
>> optional for data files).  I wasn't able to find the discussion on making
>> this required, but I assume it is so that we can accurately apply filters
>> at this level.  Is this case handled?
>>
>
> You might be right about not making this required.
>
> The intent was to change just the writer requirements. While that isn't
> strictly necessary for some fields, it helps ensure that writers follow the
> format as it evolves. But the problem is that the caller may not have this
> information (whether any partition tuple contained NaN for that field) when
> writing a new manifest list file because the old (v1) manifest list file
> didn't include it. That would be known if the manifest was rewritten, but
> may not be if the manifest is carried forward into a new manifest list.
>
> Making this required would mean that we need to set it to something when
> writing. Setting it to `true` would be fine because if the field is
> missing, we assume there may be a NaN value in a partition tuple and scan
> the manifest when looking for NaNs. But it does seem strange to require
> that default so I'd support removing this.
>
> I should also mention that even though this is optional for data files, we
> should always know when writing a manifest. That's because this is about
> the partition tuple, which is always present when writing a manifest.
>
>
>> 3)  If we're signaling the formal adoption of v2 by changing the default
>> table created, do we want to do that in the next release (0.12 assuming the
>> discussion/vote closes in time) or should we separate that with a second
>> release (0.13 or 1.0) to separate all of the other changes going into the
>> next release from the official format adoption?
>>
>
> I don't think that we are signalling formal adoption by changing the
> default. We are adopting by voting and will change the default when it
> makes sense to do that based on real-world use. I think that adopting the
> proposed v2 format means that we're confident that although there may be
> implementation bugs, we are ready to support the new v2 format as we do v1.
>
> The Java release and v2 format adoption are somewhat orthogonal. We could
> release 0.12.0 and when v2 is released later state that they are
> compatible. I think the only question is whether we find anything that we
> need to change in 0.12.0 in order to use v2 tables. That's why I think it
> is good to adopt the spec without changing the default. People can opt into
> using the new tables for the new use cases and we can get some real-world
> use before changing the default. The main consequence of adoption is that
> we commit to supporting tables written for the v2 spec (as opposed to
> changing v2 in incompatible ways) and we start to guarantee
> forward-compatibility (as opposed to making more optional -> required
> changes).
>
> Ryan
>
> --
> Ryan Blue
> Tabular
>
>
>

-- 
Ryan Blue
Tabular

Re: [DISCUSS] Adopting the v2 spec changes

Reply via email to