Hi, I'm running into an issue wherein Spark (both 1.6.1 and 2.0.0) will fail with a GC Overhead limit when creating a DataFrame from a parquet-backed partitioned Hive table with a relatively large number of parquet files (~ 175 partitions, and each partition contains many parquet files). If I then use Hive directly to create a new table from the partitioned table with CREATE TABLE AS, Hive completes that with no problem and Spark then has no problem reading the resulting table.
Part of the problem is that whenever we insert records to a parquet table, it creates a new parquet file; this results in many small parquet files for a streaming job. Since HDFS supports file appending, couldn't the records be appended to the existing parquet file as a new row group? If I understand correctly, this would be pretty straightforward - append the new data pages and then write a copy of the existing footer with the new row groups included. It wouldn't be as optimal as creating a whole new parquet file including all the data, but it would be much better than creating many small files (for many different reasons, including the crash case above). And I'm sure I can't be the only one struggling with streaming output to parquet. I know the typical solution to this is to periodically compact the small files into larger files, but it seems like parquet ought to be appendable as-is - which would obviate the need for that. Here's a partial trace of the error for reference: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.io.ObjectStreamClass.getClassDataLayout0(ObjectStreamClass.java:1251) at java.io.ObjectStreamClass.getClassDataLayout(ObjectStreamClass.java:1195) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1885) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:108) Thanks, Jeremy