As I've ventured further in working on this I've realized that it's
not practical (or even a good idea) to continue to maintain the "fixed
dictionary" path. Since the IPC protocol can have evolving
dictionaries, nearly all code paths in the codebase have to change to
work for the variable case, which leaves little to no code left that
is specialized for the "fixed" dictionary case. Additionally,
verifying that a collection of arrays all have the same dictionary is
an inexpensive operation.

Taking a step back, though, the fundamental design flaw with the
current DictionaryType is that it puts "data in the schema". This
means that the schema can have essentially unbounded size on the wire,
and we've seen issues with Flight already where potentially large
schemas with dictionaries can be serialized more than that need to be.
In general, I believe we should not put any data-dependent data in the
schema -- data belongs to the Array / RecordBatch data structures

Unfortunately, the consequences of this will be an unavoidable hard
API break in 0.14 (and in the bindings also) because constructors for
DictionaryArray and DictionaryType have to change. I wouldn't propose
a disruptive change like this unless I strongly believed it to be the
correct way forward. Otherwise behaviors (e.g. conversions to/from
pandas, etc.) should remain unchanged

In any case, I hope to have a patch up with the C++ and Python changes
later today or sometime tomorrow for further discussion

Thanks
Wes

On Tue, May 7, 2019 at 9:38 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> I have started working on this some to assess what is involved.
>
> My present plan is to have
>
> FixedDictionaryType and FixedDictionaryArray
> VariableDictionaryType and VariableDictionaryArray
> deprecate (?) current DictionaryType/DictionaryArray names, for
> clarity (thoughts about this would be welcome -- this will make the
> patch diff much larger)
>
> Given that dictionaries can change in IPC streams, I believe the
> correct approach is to change IPC read/write paths to deal only in
> variable dictionary arrays.
>
> It has occurred to me to question whether it is worth maintaining two
> variants versus having only the single general purpose variable
> dictionary form. I'm not totally sure -- in the fixed/static case you
> can assume that the dictionary is a fixed quantity and avoid any
> checking when working with multiple arrays. On the flip side, if you
> have multiple arrays all having the same dictionary, then verifying
> this fact is cheap (if the dictionary in each case is always _the same
> object_, so dict_a->Equals(dict_b) is cheap). If I could start the
> project over, I think that I would have preferred to only have the
> variable form and wait for more use cases for the less flexible fixed
> case -- in the case of interop with tools like R and Python pandas
> that have built-in categorical (factor) types, generally only a single
> piece of array data is being worked with, and so fixed and variable
> are equivalent when you only have one array.
>
> In any case, I will at least endeavor to disentangle logic that makes
> assumptions about whether the dictionary is knowable from the type
> object and put up a patch for discussion, probably later this week or
> first thing next week (since I am speaking at a conference later this
> week)
>
> - Wes
>
> On Wed, May 1, 2019 at 11:38 AM Hatem Helal <hhe...@mathworks.com> wrote:
> >
> > Thanks Wes, your proposed additional data type makes more sense to me.
> >
> > >     As a first use case for this I would be personally looking to address 
> > > reads of encoded data from
> > >     Parquet format without an intermediate pass through dense format
> > >     (which can be slow and wasteful for heavily compressed string data)
> >
> > Feel free to grab ARROW-3772 off of me...I had hoped to work on it after 
> > finishing ARROW-3769 but it seems that introducing this additional data 
> > type will be necessary to make progress on that issue.
> >
> >
> >

Reply via email to