Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)

Micah Kornfield Mon, 10 Feb 2020 12:58:41 -0800

Hi Wes and Brian,

Thanks for the feedback.  My intent in raising these issues is that they
make the spec harder to work with/implement (i.e. we have existing bugs,
etc).  I'm wondering if we should take the opportunity to simplify before
things are set in stone. If we think things are already set, then I'm OK
with that as well.


Thanks,
Micah

On Mon, Feb 10, 2020 at 12:40 PM Wes McKinney <wesmck...@gmail.com> wrote:

>
>
> On Sun, Feb 9, 2020 at 12:53 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> >
> > I'd like to understand if any one is making use of the following features
> > and if we should revisit them before 1.0.
> >
> > 1. Dictionaries can encode null values.
> > - This become error prone for things like parquet.  We seem to be
> > calculating the definition level solely based on the null bitmap.
> >
> > I might have missed something but it appears that we only check if a
> > dictionary contains nulls on the optimized path [1] but not when
> converting
> > the dictionary array back to dense, so I think the values written could
> get
> > out of sync with the rep/def levels?
> >
> > It seems we should potentially disallow dictionaries to contain null
> > values?
>
> Are you talking about the direct DictionaryArray encoding path?
>
> Since both nulls and duplicates are allowed in Arrow dictionaries, I think
> that Parquet will need to check for the null-in-dictionary case in the
> equivalent of ArrowColumnWriter::Write [1]. You could start by checking and
> erroring out.
>
> [1]:
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L324
>
> >
> > 2.  Dictionaries can nested columns which are in turn dictionary encoded
> > columns.
> >
> > - Again we aren't handling this in Parquet today, and I'm wondering if it
> > worth the effort.
> > There was a PR merged a while ago [2] to add a "skipped" integration test
> > but it doesn't look like anyone has done follow-up work to make enable
> > this/make it pass.
>
> I started looking at this the other day (on the integration tests), I'd
> like to get the C++ side fully working with integration tests.
>
> As far as Parquet it doesn't seem worth the effort to try to implement a
> faithful roundtrip. Since we are serializing the Arrow types into the file
> metadata we could at least cast back but perhaps lose the ordering.
>
> >
> > It seems simpler to keep dictionary encoding at the leafs of the schema.
> >
> > Of the two I'm a little more worried that Option #1 will break people if
> we
> > decide to disallow it.
>
> I think the ship has sailed on disallowing this at the format level.
>
> >
> > Thoughts?
> >
> > Thanks,
> > Micah
> >
> >
> > [1]
> >
> https://github.com/apache/arrow/blob/bd38beec033a2fdff192273df9b08f120e635b0c/cpp/src/parquet/encoding.cc#L765
> > [2] https://github.com/apache/arrow/pull/1848
>

Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)

Reply via email to