Rohan Garg created PARQUET-2141:
-----------------------------------

             Summary: Controlling memory utilization by ParquetReader
                 Key: PARQUET-2141
                 URL: https://issues.apache.org/jira/browse/PARQUET-2141
             Project: Parquet
          Issue Type: Improvement
            Reporter: Rohan Garg


In Apache Druid, Parquet is one of the popular form of input source to ingest 
data into a druid cluster 
(https://druid.apache.org/docs/latest/development/extensions-core/parquet.html).
 We rely on the parquet-mr library to read the parquet files and then convert 
them into Druid native format row-for-row to ingest. A considerable amount of 
our usecases ingest the whole parquet files (ie all columns in a single shot) 
into the system.

A challenge that we face is that the parquet reader loads up an entire row 
group into memory as part of its normal operation. Row groups can be quite 
large (like, 1GB large) and sometimes it creates a pressure on our reader JVM 
leading to OOMs. Further, in some other cases it ends up creating GC pressure 
on the JVM leading to a decrease in the throughput of the ingestion tasks.

To mitigate this problem, we are considering that would it be better to have an 
option to download the Parquet rowgroup/file first and memory-map it for 
reading? The code which buffers the rowgroup works on the ByteBuffer interface 
already 
(https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1763),
 so it seems like it could compliment the MMappedByteBuffer implementation too. 
Such a thing would alleviate pressure off of our reader JVM there by heavily 
reducing the chances for OOMs.

We're very open to more ideas or already tried solutions around this problem. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to