Parquet partitioning / appends

Jeremy Smith Thu, 18 Aug 2016 13:01:42 -0700

Hi,

I'm running into an issue wherein Spark (both 1.6.1 and 2.0.0) will fail
with a GC Overhead limit when creating a DataFrame from a parquet-backed
partitioned Hive table with a relatively large number of parquet files (~
175 partitions, and each partition contains many parquet files).  If I then
use Hive directly to create a new table from the partitioned table with
CREATE TABLE AS, Hive completes that with no problem and Spark then has no
problem reading the resulting table.


Part of the problem is that whenever we insert records to a parquet table,
it creates a new parquet file; this results in many small parquet files for
a streaming job. Since HDFS supports file appending, couldn't the records
be appended to the existing parquet file as a new row group? If I
understand correctly, this would be pretty straightforward - append the new
data pages and then write a copy of the existing footer with the new row
groups included.  It wouldn't be as optimal as creating a whole new parquet
file including all the data, but it would be much better than creating many
small files (for many different reasons, including the crash case above).
And I'm sure I can't be the only one struggling with streaming output to
parquet.

I know the typical solution to this is to periodically compact the small
files into larger files, but it seems like parquet ought to be appendable
as-is - which would obviate the need for that.

Here's a partial trace of the error for reference:
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at
java.io.ObjectStreamClass.getClassDataLayout0(ObjectStreamClass.java:1251)
        at
java.io.ObjectStreamClass.getClassDataLayout(ObjectStreamClass.java:1195)
        at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1885)
        at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
        at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
        at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
        at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:108)

Thanks,
Jeremy

Parquet partitioning / appends

Reply via email to