I have started working on this some to assess what is involved. My present plan is to have
FixedDictionaryType and FixedDictionaryArray VariableDictionaryType and VariableDictionaryArray deprecate (?) current DictionaryType/DictionaryArray names, for clarity (thoughts about this would be welcome -- this will make the patch diff much larger) Given that dictionaries can change in IPC streams, I believe the correct approach is to change IPC read/write paths to deal only in variable dictionary arrays. It has occurred to me to question whether it is worth maintaining two variants versus having only the single general purpose variable dictionary form. I'm not totally sure -- in the fixed/static case you can assume that the dictionary is a fixed quantity and avoid any checking when working with multiple arrays. On the flip side, if you have multiple arrays all having the same dictionary, then verifying this fact is cheap (if the dictionary in each case is always _the same object_, so dict_a->Equals(dict_b) is cheap). If I could start the project over, I think that I would have preferred to only have the variable form and wait for more use cases for the less flexible fixed case -- in the case of interop with tools like R and Python pandas that have built-in categorical (factor) types, generally only a single piece of array data is being worked with, and so fixed and variable are equivalent when you only have one array. In any case, I will at least endeavor to disentangle logic that makes assumptions about whether the dictionary is knowable from the type object and put up a patch for discussion, probably later this week or first thing next week (since I am speaking at a conference later this week) - Wes On Wed, May 1, 2019 at 11:38 AM Hatem Helal <hhe...@mathworks.com> wrote: > > Thanks Wes, your proposed additional data type makes more sense to me. > > > As a first use case for this I would be personally looking to address > > reads of encoded data from > > Parquet format without an intermediate pass through dense format > > (which can be slow and wasteful for heavily compressed string data) > > Feel free to grab ARROW-3772 off of me...I had hoped to work on it after > finishing ARROW-3769 but it seems that introducing this additional data type > will be necessary to make progress on that issue. > > >