On Sun, Feb 9, 2020 at 12:53 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > > I'd like to understand if any one is making use of the following features > and if we should revisit them before 1.0. > > 1. Dictionaries can encode null values. > - This become error prone for things like parquet. We seem to be > calculating the definition level solely based on the null bitmap. > > I might have missed something but it appears that we only check if a > dictionary contains nulls on the optimized path [1] but not when converting > the dictionary array back to dense, so I think the values written could get > out of sync with the rep/def levels? > > It seems we should potentially disallow dictionaries to contain null > values?
Are you talking about the direct DictionaryArray encoding path? Since both nulls and duplicates are allowed in Arrow dictionaries, I think that Parquet will need to check for the null-in-dictionary case in the equivalent of ArrowColumnWriter::Write [1]. You could start by checking and erroring out. [1]: https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L324 > > 2. Dictionaries can nested columns which are in turn dictionary encoded > columns. > > - Again we aren't handling this in Parquet today, and I'm wondering if it > worth the effort. > There was a PR merged a while ago [2] to add a "skipped" integration test > but it doesn't look like anyone has done follow-up work to make enable > this/make it pass. I started looking at this the other day (on the integration tests), I'd like to get the C++ side fully working with integration tests. As far as Parquet it doesn't seem worth the effort to try to implement a faithful roundtrip. Since we are serializing the Arrow types into the file metadata we could at least cast back but perhaps lose the ordering. > > It seems simpler to keep dictionary encoding at the leafs of the schema. > > Of the two I'm a little more worried that Option #1 will break people if we > decide to disallow it. I think the ship has sailed on disallowing this at the format level. > > Thoughts? > > Thanks, > Micah > > > [1] > https://github.com/apache/arrow/blob/bd38beec033a2fdff192273df9b08f120e635b0c/cpp/src/parquet/encoding.cc#L765 > [2] https://github.com/apache/arrow/pull/1848