Rohan Garg created PARQUET-2141: ----------------------------------- Summary: Controlling memory utilization by ParquetReader Key: PARQUET-2141 URL: https://issues.apache.org/jira/browse/PARQUET-2141 Project: Parquet Issue Type: Improvement Reporter: Rohan Garg
In Apache Druid, Parquet is one of the popular form of input source to ingest data into a druid cluster (https://druid.apache.org/docs/latest/development/extensions-core/parquet.html). We rely on the parquet-mr library to read the parquet files and then convert them into Druid native format row-for-row to ingest. A considerable amount of our usecases ingest the whole parquet files (ie all columns in a single shot) into the system. A challenge that we face is that the parquet reader loads up an entire row group into memory as part of its normal operation. Row groups can be quite large (like, 1GB large) and sometimes it creates a pressure on our reader JVM leading to OOMs. Further, in some other cases it ends up creating GC pressure on the JVM leading to a decrease in the throughput of the ingestion tasks. To mitigate this problem, we are considering that would it be better to have an option to download the Parquet rowgroup/file first and memory-map it for reading? The code which buffers the rowgroup works on the ByteBuffer interface already (https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1763), so it seems like it could compliment the MMappedByteBuffer implementation too. Such a thing would alleviate pressure off of our reader JVM there by heavily reducing the chances for OOMs. We're very open to more ideas or already tried solutions around this problem. -- This message was sent by Atlassian Jira (v8.20.7#820007)