On Mon, Apr 29, 2019 at 2:59 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > > > > > > * The _actual_ dictionary values for a particular Array must be stored > > > somewhere and lifetime managed. I propose to put these as a single > > > entry in ArrayData::child_data [4]. An alternative to this would be to > > > modify ArrayData to have a dictionary field that would be unused > > > except for encoded datasets > > `child_data` is supposed to mirror more or less the order of buffers in > > an IPC stream, right? Therefore I would favour a dedicated dictionary > > field (also makes fetching the dictionary trivial) > > > Without seeing the code, I'd also be in favor a separate field. Was there > a reason you were in favor of adding another element to child_data? >
I guess I should write some microbenchmarks to gauge the impact (if any) of adding another non-trivial struct member. That was the main reason (for the cost to non-dictionary arrays to be zero) > On Mon, Apr 29, 2019 at 11:53 AM Antoine Pitrou <anto...@python.org> wrote: > > > > > Hi Wes, > > > > Le 29/04/2019 à 20:10, Wes McKinney a écrit : > > > > > > * Receiving a record batch schema without the dictionaries attached > > > (e.g. in Arrow Flight), see also experimental patch [2] > > > > Note that this was finally done in a separate PR, and only required > > changes in the IPC implementation. > > > > > Here is my proposal to reconcile these issues in C++ > > > > > > * Add a new "synthetic" data type called "variable dictionary" to be > > > used alongside the existing "static dictionary" type. An instance of > > > VariableDictionaryType (name TBD) will not know what the dictionary > > > is, only the data type of the dictionary (e.g. utf8()) and the index > > > type (e.g. int32()) > > > > Interesting idea. I'm curious to see a PR. > > > > > * Define common abstract API for instances of static vs variable > > > dictionary arrays. Mainly this means making > > > DictionaryArray::dictionary [3] virtual > > > > I'm not sure this is required, especially if the following is implemented: > > > > > * The _actual_ dictionary values for a particular Array must be stored > > > somewhere and lifetime managed. I propose to put these as a single > > > entry in ArrayData::child_data [4]. An alternative to this would be to > > > modify ArrayData to have a dictionary field that would be unused > > > except for encoded datasets > > > > `child_data` is supposed to mirror more or less the order of buffers in > > an IPC stream, right? Therefore I would favour a dedicated dictionary > > field (also makes fetching the dictionary trivial). > > > > > This proposal does create some ongoing implementation and maintenance > > > burden, but to that I would make these points: > > > > > > * Many algorithms will dispatch from one type to the other (probably > > > static dispatching to the variable path), so there will not be a need > > > to implement multiple times in most cases > > > > Sounds believable indeed. > > > > Regards > > > > Antoine. > >