On Mon, Apr 29, 2019 at 2:59 PM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> >
> > > * The _actual_ dictionary values for a particular Array must be stored
> > > somewhere and lifetime managed. I propose to put these as a single
> > > entry in ArrayData::child_data [4]. An alternative to this would be to
> > > modify ArrayData to have a dictionary field that would be unused
> > > except for encoded datasets
> > `child_data` is supposed to mirror more or less the order of buffers in
> > an IPC stream, right?  Therefore I would favour a dedicated dictionary
> > field (also makes fetching the dictionary trivial)
>
>
> Without seeing the code, I'd also be in favor a separate field.  Was there
> a reason you were in favor of adding another element to child_data?
>

I guess I should write some microbenchmarks to gauge the impact (if
any) of adding another non-trivial struct member. That was the main
reason (for the cost to non-dictionary arrays to be zero)

> On Mon, Apr 29, 2019 at 11:53 AM Antoine Pitrou <anto...@python.org> wrote:
>
> >
> > Hi Wes,
> >
> > Le 29/04/2019 à 20:10, Wes McKinney a écrit :
> > >
> > > * Receiving a record batch schema without the dictionaries attached
> > > (e.g. in Arrow Flight), see also experimental patch [2]
> >
> > Note that this was finally done in a separate PR, and only required
> > changes in the IPC implementation.
> >
> > > Here is my proposal to reconcile these issues in C++
> > >
> > > * Add a new "synthetic" data type called "variable dictionary" to be
> > > used alongside the existing "static dictionary" type. An instance of
> > > VariableDictionaryType (name TBD) will not know what the dictionary
> > > is, only the data type of the dictionary (e.g. utf8()) and the index
> > > type (e.g. int32())
> >
> > Interesting idea.  I'm curious to see a PR.
> >
> > > * Define common abstract API for instances of static vs variable
> > > dictionary arrays. Mainly this means making
> > > DictionaryArray::dictionary [3] virtual
> >
> > I'm not sure this is required, especially if the following is implemented:
> >
> > > * The _actual_ dictionary values for a particular Array must be stored
> > > somewhere and lifetime managed. I propose to put these as a single
> > > entry in ArrayData::child_data [4]. An alternative to this would be to
> > > modify ArrayData to have a dictionary field that would be unused
> > > except for encoded datasets
> >
> > `child_data` is supposed to mirror more or less the order of buffers in
> > an IPC stream, right?  Therefore I would favour a dedicated dictionary
> > field (also makes fetching the dictionary trivial).
> >
> > > This proposal does create some ongoing implementation and maintenance
> > > burden, but to that I would make these points:
> > >
> > > * Many algorithms will dispatch from one type to the other (probably
> > > static dispatching to the variable path), so there will not be a need
> > > to implement multiple times in most cases
> >
> > Sounds believable indeed.
> >
> > Regards
> >
> > Antoine.
> >

Reply via email to