Re: [DISCUSS][C++] Static versus variable Arrow dictionary encoding

2019-05-15 Thread Wes McKinney
My patch for this is finally up https://github.com/apache/arrow/pull/4316 It was kind of a bloodbath, but I think this puts us on a sustainable path and unlocks a lot of efforts that we've been blocked on. On Mon, May 13, 2019 at 10:01 AM Wes McKinney wrote: > > As I've ventured further in work

Re: [DISCUSS][C++] Static versus variable Arrow dictionary encoding

2019-05-13 Thread Wes McKinney
As I've ventured further in working on this I've realized that it's not practical (or even a good idea) to continue to maintain the "fixed dictionary" path. Since the IPC protocol can have evolving dictionaries, nearly all code paths in the codebase have to change to work for the variable case, whi

Re: [DISCUSS][C++] Static versus variable Arrow dictionary encoding

2019-05-07 Thread Wes McKinney
I have started working on this some to assess what is involved. My present plan is to have FixedDictionaryType and FixedDictionaryArray VariableDictionaryType and VariableDictionaryArray deprecate (?) current DictionaryType/DictionaryArray names, for clarity (thoughts about this would be welcome

Re: [DISCUSS][C++] Static versus variable Arrow dictionary encoding

2019-05-01 Thread Hatem Helal
Thanks Wes, your proposed additional data type makes more sense to me. > As a first use case for this I would be personally looking to address > reads of encoded data from > Parquet format without an intermediate pass through dense format > (which can be slow and wasteful for heavily

Re: [DISCUSS][C++] Static versus variable Arrow dictionary encoding

2019-04-30 Thread Wes McKinney
hi Hatem, Thanks for commenting. I am not sure your solution will work reliably because code is written against arrow::DictionaryType with the presumption that the dictionary is known and static, and can be obtained by invoking DictionaryType::dictionary. In the variable dictionary case, the dict

Re: [DISCUSS][C++] Static versus variable Arrow dictionary encoding

2019-04-30 Thread Hatem Helal
Hi Wes, Thanks for the detailed writeup and I think this an important problem to solve. I spent some time thinking about this when working on ARROW-3769 and came to a similar conclusion that the current dictionary type was limiting when doing partial reads of parquet files. I'm not sure if th

Re: [DISCUSS][C++] Static versus variable Arrow dictionary encoding

2019-04-29 Thread Wes McKinney
On Mon, Apr 29, 2019 at 2:59 PM Micah Kornfield wrote: > > > > > > * The _actual_ dictionary values for a particular Array must be stored > > > somewhere and lifetime managed. I propose to put these as a single > > > entry in ArrayData::child_data [4]. An alternative to this would be to > > > modi

Re: [DISCUSS][C++] Static versus variable Arrow dictionary encoding

2019-04-29 Thread Micah Kornfield
> > > * The _actual_ dictionary values for a particular Array must be stored > > somewhere and lifetime managed. I propose to put these as a single > > entry in ArrayData::child_data [4]. An alternative to this would be to > > modify ArrayData to have a dictionary field that would be unused > > exc

Re: [DISCUSS][C++] Static versus variable Arrow dictionary encoding

2019-04-29 Thread Antoine Pitrou
Hi Wes, Le 29/04/2019 à 20:10, Wes McKinney a écrit : > > * Receiving a record batch schema without the dictionaries attached > (e.g. in Arrow Flight), see also experimental patch [2] Note that this was finally done in a separate PR, and only required changes in the IPC implementation. > Here

[DISCUSS][C++] Static versus variable Arrow dictionary encoding

2019-04-29 Thread Wes McKinney
hi all, There have been many discussions in passing on various issues and JIRA tickets over the last months and years about how to manage dictionary-encoded columnar arrays in-memory in C++. Here's a list of some problems we have encountered: * Dictionaries that may differ from one record batch t