Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)

Wes McKinney Tue, 18 Feb 2020 09:41:26 -0800

On Tue, Feb 18, 2020 at 2:01 AM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
>
>> * evaluating an expression like SUM(ISNULL($field)) is more
>> semantically ambiguous (you have to check more things) when $field is
>> a dictionary-encoded type and the values of the dictionary could be
>> null
>
> It is this type of thing that I'm worried about (parquet just happens to be 
> where I'm working in now).  Having corner cases that even experts in Arrow 
> have a hard time getting right seems not great, but I don't have solutions 
> either.
>
> Another example of this are the assumptions that null lists take zero length 
> (at least in the Arrow to Parquet code, I'm not seeing where we handle the 
> condition where the assumption is false).
>
>> * some other systems with dictionary encoding allow null as a
>> dictionary value [1]. Sometimes this is used to determine whether to
>> include the "null group" in tabulations / group aggregations
>
> This seems like a good reason to keep it.  Does R support the same double 
> nullness that Arrow has in its spec?  I wonder if an update to the spec 
> saying only one mechanism should be used for nullness would be helpful or if 
> even that is too big of a change?


It seems that when R sees NA in the factor levels, it is treated as a
separate value and not considered null/NA (FWIW, NA and NULL in R are
not the same thing, but that's a different topic that we can avoid for
our purposes)

> a <- factor(c("a", "a", "a", "b", "b"))
> a <- factor(c("a", "a", "a", "b", "b", NA))
> is.na(a)
[1] FALSE FALSE FALSE FALSE FALSE  TRUE
> table(a)
a
a b
3 2
> a <- addNA(a)
> is.na(a)
[1] FALSE FALSE FALSE FALSE FALSE FALSE
> a
[1] a    a    a    b    b    <NA>
Levels: a b <NA>
> table(a)
a
   a    b <NA>
   3    2    1

If you look at the implementation of functions like "table" you can
see a good amount of code having to do with handling this case (the
factor levels / dictionary values contain a null and so the NA result
needs to be appended to the result).

In practice it seems that the semantics may be application-determined
-- a null in the dictionary values versus a null in the dictionary
indices could have different semantic meanings (or not, it's hard to
say).

> Thanks,
> Micah
>
>
> On Tue, Feb 11, 2020 at 2:36 PM Wes McKinney <wesmck...@gmail.com> wrote:
>>
>> hi Micah,
>>
>> It seems like the null and nested issues really come up when trying to
>> translate from one dictionary encoding scheme to another. That we are
>> able to directly write dictionary-encoded data to Parquet format is
>> beneficial, but it doesn't seem like we should let the constraints of
>> Parquet's dictionary compression push back into the Arrow
>> specification. I agree that it does add complexity to implementing the
>> specification.
>>
>> Some other thoughts about the null issue
>>
>> * evaluating an expression like SUM(ISNULL($field)) is more
>> semantically ambiguous (you have to check more things) when $field is
>> a dictionary-encoded type and the values of the dictionary could be
>> null
>> * there may be situations where you want to consider null to be a value
>> * some other systems with dictionary encoding allow null as a
>> dictionary value [1]. Sometimes this is used to determine whether to
>> include the "null group" in tabulations / group aggregations
>>
>> Others might have more opinions
>>
>> [1]: https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html
>>
>> On Mon, Feb 10, 2020 at 2:57 PM Micah Kornfield <emkornfi...@gmail.com> 
>> wrote:
>> >
>> > Hi Wes and Brian,
>> >
>> > Thanks for the feedback.  My intent in raising these issues is that they 
>> > make the spec harder to work with/implement (i.e. we have existing bugs, 
>> > etc).  I'm wondering if we should take the opportunity to simplify before 
>> > things are set in stone. If we think things are already set, then I'm OK 
>> > with that as well.
>> >
>> > Thanks,
>> > Micah
>> >
>> > On Mon, Feb 10, 2020 at 12:40 PM Wes McKinney <wesmck...@gmail.com> wrote:
>> >>
>> >>
>> >>
>> >> On Sun, Feb 9, 2020 at 12:53 AM Micah Kornfield <emkornfi...@gmail.com> 
>> >> wrote:
>> >> >
>> >> > I'd like to understand if any one is making use of the following 
>> >> > features
>> >> > and if we should revisit them before 1.0.
>> >> >
>> >> > 1. Dictionaries can encode null values.
>> >> > - This become error prone for things like parquet.  We seem to be
>> >> > calculating the definition level solely based on the null bitmap.
>> >> >
>> >> > I might have missed something but it appears that we only check if a
>> >> > dictionary contains nulls on the optimized path [1] but not when 
>> >> > converting
>> >> > the dictionary array back to dense, so I think the values written could 
>> >> > get
>> >> > out of sync with the rep/def levels?
>> >> >
>> >> > It seems we should potentially disallow dictionaries to contain null
>> >> > values?
>> >>
>> >> Are you talking about the direct DictionaryArray encoding path?
>> >>
>> >> Since both nulls and duplicates are allowed in Arrow dictionaries, I 
>> >> think that Parquet will need to check for the null-in-dictionary case in 
>> >> the equivalent of ArrowColumnWriter::Write [1]. You could start by 
>> >> checking and erroring out.
>> >>
>> >> [1]: 
>> >> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L324
>> >>
>> >> >
>> >> > 2.  Dictionaries can nested columns which are in turn dictionary encoded
>> >> > columns.
>> >> >
>> >> > - Again we aren't handling this in Parquet today, and I'm wondering if 
>> >> > it
>> >> > worth the effort.
>> >> > There was a PR merged a while ago [2] to add a "skipped" integration 
>> >> > test
>> >> > but it doesn't look like anyone has done follow-up work to make enable
>> >> > this/make it pass.
>> >>
>> >> I started looking at this the other day (on the integration tests), I'd 
>> >> like to get the C++ side fully working with integration tests.
>> >>
>> >> As far as Parquet it doesn't seem worth the effort to try to implement a 
>> >> faithful roundtrip. Since we are serializing the Arrow types into the 
>> >> file metadata we could at least cast back but perhaps lose the ordering.
>> >>
>> >> >
>> >> > It seems simpler to keep dictionary encoding at the leafs of the schema.
>> >> >
>> >> > Of the two I'm a little more worried that Option #1 will break people 
>> >> > if we
>> >> > decide to disallow it.
>> >>
>> >> I think the ship has sailed on disallowing this at the format level.
>> >>
>> >> >
>> >> > Thoughts?
>> >> >
>> >> > Thanks,
>> >> > Micah
>> >> >
>> >> >
>> >> > [1]
>> >> > https://github.com/apache/arrow/blob/bd38beec033a2fdff192273df9b08f120e635b0c/cpp/src/parquet/encoding.cc#L765
>> >> > [2] https://github.com/apache/arrow/pull/1848

Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)

Reply via email to