Thanks for sharing this very illustrative benchmark.  Really nice to see the 
huge benefit for languages that have a type for modelling categorical data.

I'm interested in whether we can make the parquet/arrow integration 
automatically handle the round-trip for Arrow DictionaryArrays.  We've had this 
requested from users of the MATLAB-Parquet integration.  We've suggested 
workarounds for those users but as your benchmark shows, you need to have 
enough memory to store the "dense" representation.  I think this could be 
solved by writing metadata with the Arrow data type.  An added benefit of doing 
this at the Arrow-level is that any language that uses the C++ parquet/arrow 
integration could round-trip DictionaryArrays.

I'm not currently sure how all the pieces would fit together but let me know if 
there is interest and I'm happy to flesh this out as a PR.


On 8/2/19, 4:55 PM, "Wes McKinney" <wesmck...@gmail.com> wrote:

    I've been working (with Hatem Helal's assistance!) the last few months
    to put the pieces in place to enable reading BYTE_ARRAY columns in
    Parquet files directly to Arrow DictionaryArray. As context, it's not
    uncommon for a Parquet file to occupy ~100x less (even greater
    compression factor) space on disk than fully-decoded in memory when
    there are a lot of common strings. Users get frustrated sometimes when
    they read a "small" Parquet file and have memory use problems.
    
    I made a benchmark to exhibit an example "worst case scenario"
    
    https://gist.github.com/wesm/450d85e52844aee685c0680111cbb1d7
    
    In this example, we have a table with a single column containing 10
    million values drawn from a dictionary of 1000 values that's about 50
    kilobytes in size. Written to Parquet, the file a little over 1
    megabyte due to Parquet's layers of compression. But read naively to
    Arrow BinaryArray, about 500MB of memory is taken up (10M values * 54
    bytes per value). With the new decoding machinery, we can skip the
    dense decoding of the binary data and append the Parquet file's
    internal dictionary indices directly into an arrow::DictionaryBuilder,
    yielding a DictionaryArray at the end. The end result uses less than
    10% as much memory (about 40MB compared with 500MB) and is almost 20x
    faster to decode.
    
    The PR making this available finally in Python is here:
    https://github.com/apache/arrow/pull/4999
    
    Complex, multi-layered projects like this can be a little bit
    inscrutable when discussed strictly at a code/technical level, but I
    hope this helps show that employing dictionary encoding can have a lot
    of user impact both in memory use and performance.
    
    - Wes
    

Reply via email to