Hi Wes,
Le 29/04/2019 à 20:10, Wes McKinney a écrit : > > * Receiving a record batch schema without the dictionaries attached > (e.g. in Arrow Flight), see also experimental patch [2] Note that this was finally done in a separate PR, and only required changes in the IPC implementation. > Here is my proposal to reconcile these issues in C++ > > * Add a new "synthetic" data type called "variable dictionary" to be > used alongside the existing "static dictionary" type. An instance of > VariableDictionaryType (name TBD) will not know what the dictionary > is, only the data type of the dictionary (e.g. utf8()) and the index > type (e.g. int32()) Interesting idea. I'm curious to see a PR. > * Define common abstract API for instances of static vs variable > dictionary arrays. Mainly this means making > DictionaryArray::dictionary [3] virtual I'm not sure this is required, especially if the following is implemented: > * The _actual_ dictionary values for a particular Array must be stored > somewhere and lifetime managed. I propose to put these as a single > entry in ArrayData::child_data [4]. An alternative to this would be to > modify ArrayData to have a dictionary field that would be unused > except for encoded datasets `child_data` is supposed to mirror more or less the order of buffers in an IPC stream, right? Therefore I would favour a dedicated dictionary field (also makes fetching the dictionary trivial). > This proposal does create some ongoing implementation and maintenance > burden, but to that I would make these points: > > * Many algorithms will dispatch from one type to the other (probably > static dispatching to the variable path), so there will not be a need > to implement multiple times in most cases Sounds believable indeed. Regards Antoine.