>
> > * The _actual_ dictionary values for a particular Array must be stored
> > somewhere and lifetime managed. I propose to put these as a single
> > entry in ArrayData::child_data [4]. An alternative to this would be to
> > modify ArrayData to have a dictionary field that would be unused
> > except for encoded datasets
> `child_data` is supposed to mirror more or less the order of buffers in
> an IPC stream, right?  Therefore I would favour a dedicated dictionary
> field (also makes fetching the dictionary trivial)


Without seeing the code, I'd also be in favor a separate field.  Was there
a reason you were in favor of adding another element to child_data?

On Mon, Apr 29, 2019 at 11:53 AM Antoine Pitrou <anto...@python.org> wrote:

>
> Hi Wes,
>
> Le 29/04/2019 à 20:10, Wes McKinney a écrit :
> >
> > * Receiving a record batch schema without the dictionaries attached
> > (e.g. in Arrow Flight), see also experimental patch [2]
>
> Note that this was finally done in a separate PR, and only required
> changes in the IPC implementation.
>
> > Here is my proposal to reconcile these issues in C++
> >
> > * Add a new "synthetic" data type called "variable dictionary" to be
> > used alongside the existing "static dictionary" type. An instance of
> > VariableDictionaryType (name TBD) will not know what the dictionary
> > is, only the data type of the dictionary (e.g. utf8()) and the index
> > type (e.g. int32())
>
> Interesting idea.  I'm curious to see a PR.
>
> > * Define common abstract API for instances of static vs variable
> > dictionary arrays. Mainly this means making
> > DictionaryArray::dictionary [3] virtual
>
> I'm not sure this is required, especially if the following is implemented:
>
> > * The _actual_ dictionary values for a particular Array must be stored
> > somewhere and lifetime managed. I propose to put these as a single
> > entry in ArrayData::child_data [4]. An alternative to this would be to
> > modify ArrayData to have a dictionary field that would be unused
> > except for encoded datasets
>
> `child_data` is supposed to mirror more or less the order of buffers in
> an IPC stream, right?  Therefore I would favour a dedicated dictionary
> field (also makes fetching the dictionary trivial).
>
> > This proposal does create some ongoing implementation and maintenance
> > burden, but to that I would make these points:
> >
> > * Many algorithms will dispatch from one type to the other (probably
> > static dispatching to the variable path), so there will not be a need
> > to implement multiple times in most cases
>
> Sounds believable indeed.
>
> Regards
>
> Antoine.
>

Reply via email to