On Tue, Feb 18, 2020 at 2:01 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > > >> * evaluating an expression like SUM(ISNULL($field)) is more >> semantically ambiguous (you have to check more things) when $field is >> a dictionary-encoded type and the values of the dictionary could be >> null > > It is this type of thing that I'm worried about (parquet just happens to be > where I'm working in now). Having corner cases that even experts in Arrow > have a hard time getting right seems not great, but I don't have solutions > either. > > Another example of this are the assumptions that null lists take zero length > (at least in the Arrow to Parquet code, I'm not seeing where we handle the > condition where the assumption is false). > >> * some other systems with dictionary encoding allow null as a >> dictionary value [1]. Sometimes this is used to determine whether to >> include the "null group" in tabulations / group aggregations > > This seems like a good reason to keep it. Does R support the same double > nullness that Arrow has in its spec? I wonder if an update to the spec > saying only one mechanism should be used for nullness would be helpful or if > even that is too big of a change?
It seems that when R sees NA in the factor levels, it is treated as a separate value and not considered null/NA (FWIW, NA and NULL in R are not the same thing, but that's a different topic that we can avoid for our purposes) > a <- factor(c("a", "a", "a", "b", "b")) > a <- factor(c("a", "a", "a", "b", "b", NA)) > is.na(a) [1] FALSE FALSE FALSE FALSE FALSE TRUE > table(a) a a b 3 2 > a <- addNA(a) > is.na(a) [1] FALSE FALSE FALSE FALSE FALSE FALSE > a [1] a a a b b <NA> Levels: a b <NA> > table(a) a a b <NA> 3 2 1 If you look at the implementation of functions like "table" you can see a good amount of code having to do with handling this case (the factor levels / dictionary values contain a null and so the NA result needs to be appended to the result). In practice it seems that the semantics may be application-determined -- a null in the dictionary values versus a null in the dictionary indices could have different semantic meanings (or not, it's hard to say). > Thanks, > Micah > > > On Tue, Feb 11, 2020 at 2:36 PM Wes McKinney <wesmck...@gmail.com> wrote: >> >> hi Micah, >> >> It seems like the null and nested issues really come up when trying to >> translate from one dictionary encoding scheme to another. That we are >> able to directly write dictionary-encoded data to Parquet format is >> beneficial, but it doesn't seem like we should let the constraints of >> Parquet's dictionary compression push back into the Arrow >> specification. I agree that it does add complexity to implementing the >> specification. >> >> Some other thoughts about the null issue >> >> * evaluating an expression like SUM(ISNULL($field)) is more >> semantically ambiguous (you have to check more things) when $field is >> a dictionary-encoded type and the values of the dictionary could be >> null >> * there may be situations where you want to consider null to be a value >> * some other systems with dictionary encoding allow null as a >> dictionary value [1]. Sometimes this is used to determine whether to >> include the "null group" in tabulations / group aggregations >> >> Others might have more opinions >> >> [1]: https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html >> >> On Mon, Feb 10, 2020 at 2:57 PM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> > >> > Hi Wes and Brian, >> > >> > Thanks for the feedback. My intent in raising these issues is that they >> > make the spec harder to work with/implement (i.e. we have existing bugs, >> > etc). I'm wondering if we should take the opportunity to simplify before >> > things are set in stone. If we think things are already set, then I'm OK >> > with that as well. >> > >> > Thanks, >> > Micah >> > >> > On Mon, Feb 10, 2020 at 12:40 PM Wes McKinney <wesmck...@gmail.com> wrote: >> >> >> >> >> >> >> >> On Sun, Feb 9, 2020 at 12:53 AM Micah Kornfield <emkornfi...@gmail.com> >> >> wrote: >> >> > >> >> > I'd like to understand if any one is making use of the following >> >> > features >> >> > and if we should revisit them before 1.0. >> >> > >> >> > 1. Dictionaries can encode null values. >> >> > - This become error prone for things like parquet. We seem to be >> >> > calculating the definition level solely based on the null bitmap. >> >> > >> >> > I might have missed something but it appears that we only check if a >> >> > dictionary contains nulls on the optimized path [1] but not when >> >> > converting >> >> > the dictionary array back to dense, so I think the values written could >> >> > get >> >> > out of sync with the rep/def levels? >> >> > >> >> > It seems we should potentially disallow dictionaries to contain null >> >> > values? >> >> >> >> Are you talking about the direct DictionaryArray encoding path? >> >> >> >> Since both nulls and duplicates are allowed in Arrow dictionaries, I >> >> think that Parquet will need to check for the null-in-dictionary case in >> >> the equivalent of ArrowColumnWriter::Write [1]. You could start by >> >> checking and erroring out. >> >> >> >> [1]: >> >> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L324 >> >> >> >> > >> >> > 2. Dictionaries can nested columns which are in turn dictionary encoded >> >> > columns. >> >> > >> >> > - Again we aren't handling this in Parquet today, and I'm wondering if >> >> > it >> >> > worth the effort. >> >> > There was a PR merged a while ago [2] to add a "skipped" integration >> >> > test >> >> > but it doesn't look like anyone has done follow-up work to make enable >> >> > this/make it pass. >> >> >> >> I started looking at this the other day (on the integration tests), I'd >> >> like to get the C++ side fully working with integration tests. >> >> >> >> As far as Parquet it doesn't seem worth the effort to try to implement a >> >> faithful roundtrip. Since we are serializing the Arrow types into the >> >> file metadata we could at least cast back but perhaps lose the ordering. >> >> >> >> > >> >> > It seems simpler to keep dictionary encoding at the leafs of the schema. >> >> > >> >> > Of the two I'm a little more worried that Option #1 will break people >> >> > if we >> >> > decide to disallow it. >> >> >> >> I think the ship has sailed on disallowing this at the format level. >> >> >> >> > >> >> > Thoughts? >> >> > >> >> > Thanks, >> >> > Micah >> >> > >> >> > >> >> > [1] >> >> > https://github.com/apache/arrow/blob/bd38beec033a2fdff192273df9b08f120e635b0c/cpp/src/parquet/encoding.cc#L765 >> >> > [2] https://github.com/apache/arrow/pull/1848