[Format] Dictionary edge cases (encoding nulls and nested dictionaries)

Micah Kornfield Sat, 08 Feb 2020 22:54:00 -0800

I'd like to understand if any one is making use of the following features
and if we should revisit them before 1.0.


1. Dictionaries can encode null values.
- This become error prone for things like parquet.  We seem to be
calculating the definition level solely based on the null bitmap.

I might have missed something but it appears that we only check if a
dictionary contains nulls on the optimized path [1] but not when converting
the dictionary array back to dense, so I think the values written could get
out of sync with the rep/def levels?

It seems we should potentially disallow dictionaries to contain null
values?

2.  Dictionaries can nested columns which are in turn dictionary encoded
columns.

- Again we aren't handling this in Parquet today, and I'm wondering if it
worth the effort.
There was a PR merged a while ago [2] to add a "skipped" integration test
but it doesn't look like anyone has done follow-up work to make enable
this/make it pass.

It seems simpler to keep dictionary encoding at the leafs of the schema.

Of the two I'm a little more worried that Option #1 will break people if we
decide to disallow it.

Thoughts?

Thanks,
Micah


[1]
https://github.com/apache/arrow/blob/bd38beec033a2fdff192273df9b08f120e635b0c/cpp/src/parquet/encoding.cc#L765
[2] https://github.com/apache/arrow/pull/1848

[Format] Dictionary edge cases (encoding nulls and nested dictionaries)

Reply via email to