Re: [DISCUSS] Adopting the v2 spec changes

Ryan Blue Mon, 19 Jul 2021 17:02:16 -0700

I'll reply inline:

On Thu, Jul 15, 2021 at 1:46 PM Daniel Weeks <[email protected]> wrote:


> Overall, I'm in favor of what Ryan has proposed for v2 above.  There are a
> few points that I'm not entirely clear on, which may warrant discussion:
>
> 1)  The discussion on resurfacing distinct_count may be something we want
> to include since there is some value.  I think it warrants keeping and
> dropping the deprecation (see distinct count discussion in dev list).  This
> actually isn't a change proposal to v2, but rather signifying that it will
> be carried forward.
>

I don't think that this affects v2 because it is a non-breaking change to
forward-compatibility. I think of this as adding it back to v1 because we
can add optional fields without any problem. Older writers won't produce
it, but they never did before it was deprecated anyway. Older readers will
simply ignore it when reading files.


> 2)  NaN counts would now be required for field_summary and I'm not clear
> on what that implies for tables being promoted from v1 to v2 since missing
> this information would require evaluating all of the partition values and
> rewriting all of the manifest lists where this info is missing (it's still
> optional for data files).  I wasn't able to find the discussion on making
> this required, but I assume it is so that we can accurately apply filters
> at this level.  Is this case handled?
>

You might be right about not making this required.

The intent was to change just the writer requirements. While that isn't
strictly necessary for some fields, it helps ensure that writers follow the
format as it evolves. But the problem is that the caller may not have this
information (whether any partition tuple contained NaN for that field) when
writing a new manifest list file because the old (v1) manifest list file
didn't include it. That would be known if the manifest was rewritten, but
may not be if the manifest is carried forward into a new manifest list.

Making this required would mean that we need to set it to something when
writing. Setting it to `true` would be fine because if the field is
missing, we assume there may be a NaN value in a partition tuple and scan
the manifest when looking for NaNs. But it does seem strange to require
that default so I'd support removing this.

I should also mention that even though this is optional for data files, we
should always know when writing a manifest. That's because this is about
the partition tuple, which is always present when writing a manifest.


> 3)  If we're signaling the formal adoption of v2 by changing the default
> table created, do we want to do that in the next release (0.12 assuming the
> discussion/vote closes in time) or should we separate that with a second
> release (0.13 or 1.0) to separate all of the other changes going into the
> next release from the official format adoption?
>

I don't think that we are signalling formal adoption by changing the
default. We are adopting by voting and will change the default when it
makes sense to do that based on real-world use. I think that adopting the
proposed v2 format means that we're confident that although there may be
implementation bugs, we are ready to support the new v2 format as we do v1.

The Java release and v2 format adoption are somewhat orthogonal. We could
release 0.12.0 and when v2 is released later state that they are
compatible. I think the only question is whether we find anything that we
need to change in 0.12.0 in order to use v2 tables. That's why I think it
is good to adopt the spec without changing the default. People can opt into
using the new tables for the new use cases and we can get some real-world
use before changing the default. The main consequence of adoption is that
we commit to supporting tables written for the v2 spec (as opposed to
changing v2 in incompatible ways) and we start to guarantee
forward-compatibility (as opposed to making more optional -> required
changes).

Ryan

-- 
Ryan Blue
Tabular

Re: [DISCUSS] Adopting the v2 spec changes

Reply via email to