hi Hatem, There are several issues open about this already (I'll have to dig them up), so this is something that we have desired for a long time, but have not gotten around to implementing.
Since many Parquet writers use dictionary encoding, it would make most sense to have an option to return DictionaryArray (which can be converted to pandas.Categorical) from any column, and internally we will perform the conversion from the encoded Parquet format as efficiently as possible. There are many cases to consider: * Dictionary encoded, but different dictionaries in each row group (this is actually the most likely scenario) * Dictionary encoded, but the same dictionary in all row groups * PLAIN encoded data that we pass through DictionaryBuilder as it is decoded to yield DictionaryArray * Dictionary encoded, but switch over to PLAIN encoding mid-stream Having column metadata to automatically "opt in" to the DictionaryArray conversion sounds reasonable (so long as Arrow readers have a way to opt out, probably via a global flag to ignore such custom metadata fields) for usability. Part of the reason this work was not done in the past was because some of our hash table machinery was a bit immature. Antoine has recently improved things significantly, so it should be a lot easier now to do this work. This is a quite large project, though, and one that affects a _lot_ of users, so I would be willing to take an initial pass on the implementation. Along with completing the nested data read/write path I would say this is the 2nd highest priority project in parquet-cpp for Arrow users. - Wes On Thu, Jan 24, 2019 at 9:59 AM Hatem Helal <hatem.he...@mathworks.co.uk> wrote: > > Hi everyone, > > I wanted to gauge interest and feasibility for adding support for natively > reading an arrow::DictionaryArray from a parquet file. Currently, writing an > arrow::DictionaryArray is read back as the native index type [0]. I came > across a prior discussion for this problem in the context of pandas [1] but I > think this would be useful for other arrow clients (C++ or otherwise). > > The solution I had in mind would be to add arrow type information as column > metadata. This metadata would then be used when reading back the parquet > file to determine which arrow type to create for the column data. > > I’m willing to contribute this feature but first wanted to get some feedback > on whether this would be generally useful and if the high-level proposed > solution would make sense. > > Thanks! > > Hatem > > > [0] This test demonstrates this behavior > https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/arrow-reader-writer-test.cc#L1848 > [1] https://github.com/apache/arrow/issues/1688