> > > * The _actual_ dictionary values for a particular Array must be stored > > somewhere and lifetime managed. I propose to put these as a single > > entry in ArrayData::child_data [4]. An alternative to this would be to > > modify ArrayData to have a dictionary field that would be unused > > except for encoded datasets > `child_data` is supposed to mirror more or less the order of buffers in > an IPC stream, right? Therefore I would favour a dedicated dictionary > field (also makes fetching the dictionary trivial)
Without seeing the code, I'd also be in favor a separate field. Was there a reason you were in favor of adding another element to child_data? On Mon, Apr 29, 2019 at 11:53 AM Antoine Pitrou <anto...@python.org> wrote: > > Hi Wes, > > Le 29/04/2019 à 20:10, Wes McKinney a écrit : > > > > * Receiving a record batch schema without the dictionaries attached > > (e.g. in Arrow Flight), see also experimental patch [2] > > Note that this was finally done in a separate PR, and only required > changes in the IPC implementation. > > > Here is my proposal to reconcile these issues in C++ > > > > * Add a new "synthetic" data type called "variable dictionary" to be > > used alongside the existing "static dictionary" type. An instance of > > VariableDictionaryType (name TBD) will not know what the dictionary > > is, only the data type of the dictionary (e.g. utf8()) and the index > > type (e.g. int32()) > > Interesting idea. I'm curious to see a PR. > > > * Define common abstract API for instances of static vs variable > > dictionary arrays. Mainly this means making > > DictionaryArray::dictionary [3] virtual > > I'm not sure this is required, especially if the following is implemented: > > > * The _actual_ dictionary values for a particular Array must be stored > > somewhere and lifetime managed. I propose to put these as a single > > entry in ArrayData::child_data [4]. An alternative to this would be to > > modify ArrayData to have a dictionary field that would be unused > > except for encoded datasets > > `child_data` is supposed to mirror more or less the order of buffers in > an IPC stream, right? Therefore I would favour a dedicated dictionary > field (also makes fetching the dictionary trivial). > > > This proposal does create some ongoing implementation and maintenance > > burden, but to that I would make these points: > > > > * Many algorithms will dispatch from one type to the other (probably > > static dispatching to the variable path), so there will not be a need > > to implement multiple times in most cases > > Sounds believable indeed. > > Regards > > Antoine. >