As I've ventured further in working on this I've realized that it's not practical (or even a good idea) to continue to maintain the "fixed dictionary" path. Since the IPC protocol can have evolving dictionaries, nearly all code paths in the codebase have to change to work for the variable case, which leaves little to no code left that is specialized for the "fixed" dictionary case. Additionally, verifying that a collection of arrays all have the same dictionary is an inexpensive operation.
Taking a step back, though, the fundamental design flaw with the current DictionaryType is that it puts "data in the schema". This means that the schema can have essentially unbounded size on the wire, and we've seen issues with Flight already where potentially large schemas with dictionaries can be serialized more than that need to be. In general, I believe we should not put any data-dependent data in the schema -- data belongs to the Array / RecordBatch data structures Unfortunately, the consequences of this will be an unavoidable hard API break in 0.14 (and in the bindings also) because constructors for DictionaryArray and DictionaryType have to change. I wouldn't propose a disruptive change like this unless I strongly believed it to be the correct way forward. Otherwise behaviors (e.g. conversions to/from pandas, etc.) should remain unchanged In any case, I hope to have a patch up with the C++ and Python changes later today or sometime tomorrow for further discussion Thanks Wes On Tue, May 7, 2019 at 9:38 PM Wes McKinney <wesmck...@gmail.com> wrote: > > I have started working on this some to assess what is involved. > > My present plan is to have > > FixedDictionaryType and FixedDictionaryArray > VariableDictionaryType and VariableDictionaryArray > deprecate (?) current DictionaryType/DictionaryArray names, for > clarity (thoughts about this would be welcome -- this will make the > patch diff much larger) > > Given that dictionaries can change in IPC streams, I believe the > correct approach is to change IPC read/write paths to deal only in > variable dictionary arrays. > > It has occurred to me to question whether it is worth maintaining two > variants versus having only the single general purpose variable > dictionary form. I'm not totally sure -- in the fixed/static case you > can assume that the dictionary is a fixed quantity and avoid any > checking when working with multiple arrays. On the flip side, if you > have multiple arrays all having the same dictionary, then verifying > this fact is cheap (if the dictionary in each case is always _the same > object_, so dict_a->Equals(dict_b) is cheap). If I could start the > project over, I think that I would have preferred to only have the > variable form and wait for more use cases for the less flexible fixed > case -- in the case of interop with tools like R and Python pandas > that have built-in categorical (factor) types, generally only a single > piece of array data is being worked with, and so fixed and variable > are equivalent when you only have one array. > > In any case, I will at least endeavor to disentangle logic that makes > assumptions about whether the dictionary is knowable from the type > object and put up a patch for discussion, probably later this week or > first thing next week (since I am speaking at a conference later this > week) > > - Wes > > On Wed, May 1, 2019 at 11:38 AM Hatem Helal <hhe...@mathworks.com> wrote: > > > > Thanks Wes, your proposed additional data type makes more sense to me. > > > > > As a first use case for this I would be personally looking to address > > > reads of encoded data from > > > Parquet format without an intermediate pass through dense format > > > (which can be slow and wasteful for heavily compressed string data) > > > > Feel free to grab ARROW-3772 off of me...I had hoped to work on it after > > finishing ARROW-3769 but it seems that introducing this additional data > > type will be necessary to make progress on that issue. > > > > > >