Re: Round-trip of categorical data with Arrow and Parquet

Wes McKinney Fri, 25 Jan 2019 06:05:09 -0800

I'm undertaking some refactoring to expose some of the decoding
internals in a more direct way to the Arrow builder classes, see
https://issues.apache.org/jira/browse/PARQUET-1508.


I'll try to have a patch up sometime today or over the weekend for you
to review (WIP:
https://github.com/wesm/arrow/tree/parquet-decode-into-arrow-builder)

After that we should be able to alter the Arrow read path to read
directly into a arrow::BinaryDictionaryBuilder.



On Thu, Jan 24, 2019 at 10:59 AM Hatem Helal
<hatem.he...@mathworks.co.uk> wrote:
>
> Thanks Wes,
>
> Glad to hear this in your plan.
>
> I probably should have done this earlier...but here are some JIRA tickets 
> that seem to cover this:
>
> https://issues.apache.org/jira/browse/ARROW-3772
> https://issues.apache.org/jira/browse/ARROW-3325
> https://issues.apache.org/jira/browse/ARROW-3769
>
>
>
> On 1/24/19, 4:27 PM, "Wes McKinney" <wesmck...@gmail.com> wrote:
>
>     hi Hatem,
>
>     There are several issues open about this already (I'll have to dig
>     them up), so this is something that we have desired for a long time,
>     but have not gotten around to implementing.
>
>     Since many Parquet writers use dictionary encoding, it would make most
>     sense to have an option to return DictionaryArray (which can be
>     converted to pandas.Categorical) from any column, and internally we
>     will perform the conversion from the encoded Parquet format as
>     efficiently as possible.
>
>     There are many cases to consider:
>
>     * Dictionary encoded, but different dictionaries in each row group
>     (this is actually the most likely scenario)
>     * Dictionary encoded, but the same dictionary in all row groups
>     * PLAIN encoded data that we pass through DictionaryBuilder as it is
>     decoded to yield DictionaryArray
>     * Dictionary encoded, but switch over to PLAIN encoding mid-stream
>
>     Having column metadata to automatically "opt in" to the
>     DictionaryArray conversion sounds reasonable (so long as Arrow readers
>     have a way to opt out, probably via a global flag to ignore such
>     custom metadata fields) for usability.
>
>     Part of the reason this work was not done in the past was because some
>     of our hash table machinery was a bit immature. Antoine has recently
>     improved things significantly, so it should be a lot easier now to do
>     this work. This is a quite large project, though, and one that affects
>     a _lot_ of users, so I would be willing to take an initial pass on the
>     implementation.
>
>     Along with completing the nested data read/write path I would say this
>     is the 2nd highest priority project in parquet-cpp for Arrow users.
>
>     - Wes
>
>     On Thu, Jan 24, 2019 at 9:59 AM Hatem Helal <hatem.he...@mathworks.co.uk> 
> wrote:
>     >
>     > Hi everyone,
>     >
>     > I wanted to gauge interest and feasibility for adding support for 
> natively reading an arrow::DictionaryArray from a parquet file.  Currently, 
> writing an arrow::DictionaryArray is read back as the native index type [0].  
> I came across a prior discussion for this problem in the context of pandas 
> [1] but I think this would be useful for other arrow clients (C++ or 
> otherwise).
>     >
>     > The solution I had in mind would be to add arrow type information as 
> column metadata.  This metadata would then be used when reading back the 
> parquet file to determine which arrow type to create for the column data.
>     >
>     > I’m willing to contribute this feature but first wanted to get some 
> feedback on whether this would be generally useful and if the high-level 
> proposed solution would make sense.
>     >
>     > Thanks!
>     >
>     > Hatem
>     >
>     >
>     > [0] This test demonstrates this behavior
>     > 
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/arrow-reader-writer-test.cc#L1848
>     > [1] https://github.com/apache/arrow/issues/1688
>
>

Re: Round-trip of categorical data with Arrow and Parquet

Reply via email to