Remek Zajac created ARROW-4688:
----------------------------------

             Summary: [C++][Parquet] 16MB limit on (nested) column chunk 
prevents tuning row_group_size
                 Key: ARROW-4688
                 URL: https://issues.apache.org/jira/browse/ARROW-4688
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Remek Zajac


We working on parquet files that involve nested lists. At most they are 
multi-dimensional lists of simple types (never structs), but i understand, for 
Parquet, they're still nested columns and involve repetition levels. 

Some of these columns hold lists of rather large byte arrays (that dominate the 
overall size of the row). When we bump the `row_group_size` to above 16MB we 
see: 

 
{code:java}
File "pyarrow/_parquet.pyx", line 700, in 
pyarrow._parquet.ParquetReader.read_row_group
 File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented 
for chunked array outputs{code}
 

I conclude it's 
[this|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L848]
 bit complaining:

 
{code:java}
template <typename ParquetType>
        Status PrimitiveImpl::WrapIntoListArray(Datum* inout_array) {
        if (descr_->max_repetition_level() == 0) {
          // Flat, no action
          return Status::OK();
        }
        
        std::shared_ptr<Array> flat_array;
        
        // ARROW-3762(wesm): If inout_array is a chunked array, we reject as 
this is
        // not yet implemented
        if (inout_array->kind() == Datum::CHUNKED_ARRAY) {
          if (inout_array->chunked_array()->num_chunks() > 1) {
            return Status::NotImplemented(
              "Nested data conversions not implemented for "
              "chunked array outputs");{code}
 

This appears to happen in the callstack of 
ColumnReader::ColumnReaderImpl::NextBatch 
and it appears to be provoked by 
[this|https://github.com/apache/arrow/blob/de84293d9c93fe721cd127f1a27acc94fe290f3f/cpp/src/parquet/arrow/record_reader.cc#L604]
 constant:
{code:java}
template <>     
void TypedRecordReader<ByteArrayType>::InitializeBuilder() {     
  // Maximum of 16MB chunks     
  constexpr int32_t kBinaryChunksize = 1 << 24;     
  DCHECK_EQ(descr_->physical_type(), Type::BYTE_ARRAY);       
  builder_.reset(
    new::arrow::internal::ChunkedBinaryBuilder(kBinaryChunksize, pool_));  }   
{code}
Which appears to imply that the column chunk data, if larger than 
kBinaryChunksize (hardcoded to 16MB), is returned as a Datum::CHUNKED_ARRAY of 
more than one (16MB) chunks. Which ultimatelly leads to the 
Status::NotImplemented error.

We have no influence over what data we ingest, we have some influence in how we 
flatten it and we need to tune our row_group_size to something sensibly larger 
than 16MB. 

We have see no obvious workaround for this and so we need to ask (1) if the 
above diagnosis appears to correct (2) do people see any sensible workarounds 
(3) is there an imminent intention to fix this in the Arrow community and if 
not, how difficult would it be to fix this (in case we can afford helping)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to