Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)

Micah Kornfield Tue, 18 Feb 2020 00:02:15 -0800

> * evaluating an expression like SUM(ISNULL($field)) is more
> semantically ambiguous (you have to check more things) when $field is
> a dictionary-encoded type and the values of the dictionary could be
> null


It is this type of thing that I'm worried about (parquet just happens to be
where I'm working in now).  Having corner cases that even experts in Arrow
have a hard time getting right seems not great, but I don't have solutions
either.

Another example of this are the assumptions that null lists take zero
length (at least in the Arrow to Parquet code, I'm not seeing where we
handle the condition where the assumption is false).

* some other systems with dictionary encoding allow null as a
> dictionary value [1]. Sometimes this is used to determine whether to
> include the "null group" in tabulations / group aggregations

This seems like a good reason to keep it.  Does R support the same double
nullness that Arrow has in its spec?  I wonder if an update to the spec
saying only one mechanism should be used for nullness would be helpful or
if even that is too big of a change?

Thanks,
Micah


On Tue, Feb 11, 2020 at 2:36 PM Wes McKinney <wesmck...@gmail.com> wrote:

> hi Micah,
>
> It seems like the null and nested issues really come up when trying to
> translate from one dictionary encoding scheme to another. That we are
> able to directly write dictionary-encoded data to Parquet format is
> beneficial, but it doesn't seem like we should let the constraints of
> Parquet's dictionary compression push back into the Arrow
> specification. I agree that it does add complexity to implementing the
> specification.
>
> Some other thoughts about the null issue
>
> * evaluating an expression like SUM(ISNULL($field)) is more
> semantically ambiguous (you have to check more things) when $field is
> a dictionary-encoded type and the values of the dictionary could be
> null
> * there may be situations where you want to consider null to be a value
> * some other systems with dictionary encoding allow null as a
> dictionary value [1]. Sometimes this is used to determine whether to
> include the "null group" in tabulations / group aggregations
>
> Others might have more opinions
>
> [1]: https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html
>
> On Mon, Feb 10, 2020 at 2:57 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> >
> > Hi Wes and Brian,
> >
> > Thanks for the feedback.  My intent in raising these issues is that they
> make the spec harder to work with/implement (i.e. we have existing bugs,
> etc).  I'm wondering if we should take the opportunity to simplify before
> things are set in stone. If we think things are already set, then I'm OK
> with that as well.
> >
> > Thanks,
> > Micah
> >
> > On Mon, Feb 10, 2020 at 12:40 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >>
> >>
> >>
> >> On Sun, Feb 9, 2020 at 12:53 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> >> >
> >> > I'd like to understand if any one is making use of the following
> features
> >> > and if we should revisit them before 1.0.
> >> >
> >> > 1. Dictionaries can encode null values.
> >> > - This become error prone for things like parquet.  We seem to be
> >> > calculating the definition level solely based on the null bitmap.
> >> >
> >> > I might have missed something but it appears that we only check if a
> >> > dictionary contains nulls on the optimized path [1] but not when
> converting
> >> > the dictionary array back to dense, so I think the values written
> could get
> >> > out of sync with the rep/def levels?
> >> >
> >> > It seems we should potentially disallow dictionaries to contain null
> >> > values?
> >>
> >> Are you talking about the direct DictionaryArray encoding path?
> >>
> >> Since both nulls and duplicates are allowed in Arrow dictionaries, I
> think that Parquet will need to check for the null-in-dictionary case in
> the equivalent of ArrowColumnWriter::Write [1]. You could start by checking
> and erroring out.
> >>
> >> [1]:
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L324
> >>
> >> >
> >> > 2.  Dictionaries can nested columns which are in turn dictionary
> encoded
> >> > columns.
> >> >
> >> > - Again we aren't handling this in Parquet today, and I'm wondering
> if it
> >> > worth the effort.
> >> > There was a PR merged a while ago [2] to add a "skipped" integration
> test
> >> > but it doesn't look like anyone has done follow-up work to make enable
> >> > this/make it pass.
> >>
> >> I started looking at this the other day (on the integration tests), I'd
> like to get the C++ side fully working with integration tests.
> >>
> >> As far as Parquet it doesn't seem worth the effort to try to implement
> a faithful roundtrip. Since we are serializing the Arrow types into the
> file metadata we could at least cast back but perhaps lose the ordering.
> >>
> >> >
> >> > It seems simpler to keep dictionary encoding at the leafs of the
> schema.
> >> >
> >> > Of the two I'm a little more worried that Option #1 will break people
> if we
> >> > decide to disallow it.
> >>
> >> I think the ship has sailed on disallowing this at the format level.
> >>
> >> >
> >> > Thoughts?
> >> >
> >> > Thanks,
> >> > Micah
> >> >
> >> >
> >> > [1]
> >> >
> https://github.com/apache/arrow/blob/bd38beec033a2fdff192273df9b08f120e635b0c/cpp/src/parquet/encoding.cc#L765
> >> > [2] https://github.com/apache/arrow/pull/1848
>

Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)

Reply via email to