I'm undertaking some refactoring to expose some of the decoding internals in a more direct way to the Arrow builder classes, see https://issues.apache.org/jira/browse/PARQUET-1508.
I'll try to have a patch up sometime today or over the weekend for you to review (WIP: https://github.com/wesm/arrow/tree/parquet-decode-into-arrow-builder) After that we should be able to alter the Arrow read path to read directly into a arrow::BinaryDictionaryBuilder. On Thu, Jan 24, 2019 at 10:59 AM Hatem Helal <hatem.he...@mathworks.co.uk> wrote: > > Thanks Wes, > > Glad to hear this in your plan. > > I probably should have done this earlier...but here are some JIRA tickets > that seem to cover this: > > https://issues.apache.org/jira/browse/ARROW-3772 > https://issues.apache.org/jira/browse/ARROW-3325 > https://issues.apache.org/jira/browse/ARROW-3769 > > > > On 1/24/19, 4:27 PM, "Wes McKinney" <wesmck...@gmail.com> wrote: > > hi Hatem, > > There are several issues open about this already (I'll have to dig > them up), so this is something that we have desired for a long time, > but have not gotten around to implementing. > > Since many Parquet writers use dictionary encoding, it would make most > sense to have an option to return DictionaryArray (which can be > converted to pandas.Categorical) from any column, and internally we > will perform the conversion from the encoded Parquet format as > efficiently as possible. > > There are many cases to consider: > > * Dictionary encoded, but different dictionaries in each row group > (this is actually the most likely scenario) > * Dictionary encoded, but the same dictionary in all row groups > * PLAIN encoded data that we pass through DictionaryBuilder as it is > decoded to yield DictionaryArray > * Dictionary encoded, but switch over to PLAIN encoding mid-stream > > Having column metadata to automatically "opt in" to the > DictionaryArray conversion sounds reasonable (so long as Arrow readers > have a way to opt out, probably via a global flag to ignore such > custom metadata fields) for usability. > > Part of the reason this work was not done in the past was because some > of our hash table machinery was a bit immature. Antoine has recently > improved things significantly, so it should be a lot easier now to do > this work. This is a quite large project, though, and one that affects > a _lot_ of users, so I would be willing to take an initial pass on the > implementation. > > Along with completing the nested data read/write path I would say this > is the 2nd highest priority project in parquet-cpp for Arrow users. > > - Wes > > On Thu, Jan 24, 2019 at 9:59 AM Hatem Helal <hatem.he...@mathworks.co.uk> > wrote: > > > > Hi everyone, > > > > I wanted to gauge interest and feasibility for adding support for > natively reading an arrow::DictionaryArray from a parquet file. Currently, > writing an arrow::DictionaryArray is read back as the native index type [0]. > I came across a prior discussion for this problem in the context of pandas > [1] but I think this would be useful for other arrow clients (C++ or > otherwise). > > > > The solution I had in mind would be to add arrow type information as > column metadata. This metadata would then be used when reading back the > parquet file to determine which arrow type to create for the column data. > > > > I’m willing to contribute this feature but first wanted to get some > feedback on whether this would be generally useful and if the high-level > proposed solution would make sense. > > > > Thanks! > > > > Hatem > > > > > > [0] This test demonstrates this behavior > > > https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/arrow-reader-writer-test.cc#L1848 > > [1] https://github.com/apache/arrow/issues/1688 > >