I have started working on this some to assess what is involved.

My present plan is to have

FixedDictionaryType and FixedDictionaryArray
VariableDictionaryType and VariableDictionaryArray
deprecate (?) current DictionaryType/DictionaryArray names, for
clarity (thoughts about this would be welcome -- this will make the
patch diff much larger)

Given that dictionaries can change in IPC streams, I believe the
correct approach is to change IPC read/write paths to deal only in
variable dictionary arrays.

It has occurred to me to question whether it is worth maintaining two
variants versus having only the single general purpose variable
dictionary form. I'm not totally sure -- in the fixed/static case you
can assume that the dictionary is a fixed quantity and avoid any
checking when working with multiple arrays. On the flip side, if you
have multiple arrays all having the same dictionary, then verifying
this fact is cheap (if the dictionary in each case is always _the same
object_, so dict_a->Equals(dict_b) is cheap). If I could start the
project over, I think that I would have preferred to only have the
variable form and wait for more use cases for the less flexible fixed
case -- in the case of interop with tools like R and Python pandas
that have built-in categorical (factor) types, generally only a single
piece of array data is being worked with, and so fixed and variable
are equivalent when you only have one array.

In any case, I will at least endeavor to disentangle logic that makes
assumptions about whether the dictionary is knowable from the type
object and put up a patch for discussion, probably later this week or
first thing next week (since I am speaking at a conference later this
week)

- Wes

On Wed, May 1, 2019 at 11:38 AM Hatem Helal <hhe...@mathworks.com> wrote:
>
> Thanks Wes, your proposed additional data type makes more sense to me.
>
> >     As a first use case for this I would be personally looking to address 
> > reads of encoded data from
> >     Parquet format without an intermediate pass through dense format
> >     (which can be slow and wasteful for heavily compressed string data)
>
> Feel free to grab ARROW-3772 off of me...I had hoped to work on it after 
> finishing ARROW-3769 but it seems that introducing this additional data type 
> will be necessary to make progress on that issue.
>
>
>

Reply via email to