Parquet files not getting coalesced to smaller number of files

SRK Sat, 28 Nov 2015 20:23:01 -0800

Hi,

I have the following code that saves the parquet files in my hourly batch to
hdfs. My idea is to coalesce the files to 1500 smaller files. The first run
it gives me 1500 files in hdfs. For the next runs the files seem to be
increasing even though I coalesce.


 Its not getting coalesced to 1500 files as I want. I also have an example
that I am using in the end. Please let me know if there is a different and
more efficient way of doing this. 


        val job = Job.getInstance()

        var filePath = "path"


        val metricsPath: Path = new Path(filePath)

        //Check if inputFile exists
        val fs: FileSystem = FileSystem.get(job.getConfiguration)

        if (fs.exists(metricsPath)) {
          fs.delete(metricsPath, true)
        }


        // Configure the ParquetOutputFormat to use Avro as the
serialization format
        ParquetOutputFormat.setWriteSupportClass(job,
classOf[AvroWriteSupport])
        // You need to pass the schema to AvroParquet when you are writing
objects but not when you
        // are reading them. The schema is saved in Parquet file for future
readers to use.
        AvroParquetOutputFormat.setSchema(job, Metrics.SCHEMA$)


        // Create a PairRDD with all keys set to null and wrap each Metrics
in serializable objects
        val metricsToBeSaved = metrics.map(metricRecord => (null, new
SerializableMetrics(new     Metrics(metricRecord._1, metricRecord._2._1,
metricRecord._2._2))));

        metricsToBeSaved.coalesce(1500)
        // Save the RDD to a Parquet file in our temporary output directory
        metricsToBeSaved.saveAsNewAPIHadoopFile(filePath, classOf[Void],
classOf[Metrics],
          classOf[ParquetOutputFormat[Metrics]], job.getConfiguration)


https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-files-not-getting-coalesced-to-smaller-number-of-files-tp25509.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Parquet files not getting coalesced to smaller number of files

Reply via email to