On Sun, Feb 9, 2020 at 12:53 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:
>
> I'd like to understand if any one is making use of the following features
> and if we should revisit them before 1.0.
>
> 1. Dictionaries can encode null values.
> - This become error prone for things like parquet.  We seem to be
> calculating the definition level solely based on the null bitmap.
>
> I might have missed something but it appears that we only check if a
> dictionary contains nulls on the optimized path [1] but not when
converting
> the dictionary array back to dense, so I think the values written could
get
> out of sync with the rep/def levels?
>
> It seems we should potentially disallow dictionaries to contain null
> values?

Are you talking about the direct DictionaryArray encoding path?

Since both nulls and duplicates are allowed in Arrow dictionaries, I think
that Parquet will need to check for the null-in-dictionary case in the
equivalent of ArrowColumnWriter::Write [1]. You could start by checking and
erroring out.

[1]:
https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L324

>
> 2.  Dictionaries can nested columns which are in turn dictionary encoded
> columns.
>
> - Again we aren't handling this in Parquet today, and I'm wondering if it
> worth the effort.
> There was a PR merged a while ago [2] to add a "skipped" integration test
> but it doesn't look like anyone has done follow-up work to make enable
> this/make it pass.

I started looking at this the other day (on the integration tests), I'd
like to get the C++ side fully working with integration tests.

As far as Parquet it doesn't seem worth the effort to try to implement a
faithful roundtrip. Since we are serializing the Arrow types into the file
metadata we could at least cast back but perhaps lose the ordering.

>
> It seems simpler to keep dictionary encoding at the leafs of the schema.
>
> Of the two I'm a little more worried that Option #1 will break people if
we
> decide to disallow it.

I think the ship has sailed on disallowing this at the format level.

>
> Thoughts?
>
> Thanks,
> Micah
>
>
> [1]
>
https://github.com/apache/arrow/blob/bd38beec033a2fdff192273df9b08f120e635b0c/cpp/src/parquet/encoding.cc#L765
> [2] https://github.com/apache/arrow/pull/1848

Reply via email to