Re: Round-trip of categorical data with Arrow and Parquet

Wes McKinney Thu, 24 Jan 2019 08:27:52 -0800

hi Hatem,

There are several issues open about this already (I'll have to dig
them up), so this is something that we have desired for a long time,
but have not gotten around to implementing.

Since many Parquet writers use dictionary encoding, it would make most
sense to have an option to return DictionaryArray (which can be
converted to pandas.Categorical) from any column, and internally we
will perform the conversion from the encoded Parquet format as
efficiently as possible.

There are many cases to consider:

* Dictionary encoded, but different dictionaries in each row group
(this is actually the most likely scenario)
* Dictionary encoded, but the same dictionary in all row groups
* PLAIN encoded data that we pass through DictionaryBuilder as it is
decoded to yield DictionaryArray
* Dictionary encoded, but switch over to PLAIN encoding mid-stream

Having column metadata to automatically "opt in" to the
DictionaryArray conversion sounds reasonable (so long as Arrow readers
have a way to opt out, probably via a global flag to ignore such
custom metadata fields) for usability.

Part of the reason this work was not done in the past was because some
of our hash table machinery was a bit immature. Antoine has recently
improved things significantly, so it should be a lot easier now to do
this work. This is a quite large project, though, and one that affects
a _lot_ of users, so I would be willing to take an initial pass on the
implementation.

Along with completing the nested data read/write path I would say this
is the 2nd highest priority project in parquet-cpp for Arrow users.

- Wes

On Thu, Jan 24, 2019 at 9:59 AM Hatem Helal <[email protected]> wrote:
>
> Hi everyone,
>
> I wanted to gauge interest and feasibility for adding support for natively 
> reading an arrow::DictionaryArray from a parquet file.  Currently, writing an 
> arrow::DictionaryArray is read back as the native index type [0].  I came 
> across a prior discussion for this problem in the context of pandas [1] but I 
> think this would be useful for other arrow clients (C++ or otherwise).
>
> The solution I had in mind would be to add arrow type information as column 
> metadata.  This metadata would then be used when reading back the parquet 
> file to determine which arrow type to create for the column data.
>
> I’m willing to contribute this feature but first wanted to get some feedback 
> on whether this would be generally useful and if the high-level proposed 
> solution would make sense.
>
> Thanks!
>
> Hatem
>
>
> [0] This test demonstrates this behavior
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/arrow-reader-writer-test.cc#L1848
> [1] https://github.com/apache/arrow/issues/1688

Re: Round-trip of categorical data with Arrow and Parquet

Reply via email to