Re: Parquet files not getting coalesced to smaller number of files

Cheng Lian Sun, 29 Nov 2015 23:05:50 -0800

RDD.coalesce(n) returns a new RDD rather than modifying the originalRDD. So what you need is:


    metricsToBeSaved.coalesce(1500).saveAsNewAPIHadoopFile(...)


Cheng

On 11/29/15 12:21 PM, SRK wrote:

Hi,

I have the following code that saves the parquet files in my hourly batch to
hdfs. My idea is to coalesce the files to 1500 smaller files. The first run
it gives me 1500 files in hdfs. For the next runs the files seem to be
increasing even though I coalesce.

  Its not getting coalesced to 1500 files as I want. I also have an example
that I am using in the end. Please let me know if there is a different and
more efficient way of doing this.


         val job = Job.getInstance()

         var filePath = "path"


         val metricsPath: Path = new Path(filePath)

         //Check if inputFile exists
         val fs: FileSystem = FileSystem.get(job.getConfiguration)

         if (fs.exists(metricsPath)) {
           fs.delete(metricsPath, true)
         }


         // Configure the ParquetOutputFormat to use Avro as the
serialization format
         ParquetOutputFormat.setWriteSupportClass(job,
classOf[AvroWriteSupport])
         // You need to pass the schema to AvroParquet when you are writing
objects but not when you
         // are reading them. The schema is saved in Parquet file for future
readers to use.
         AvroParquetOutputFormat.setSchema(job, Metrics.SCHEMA$)


         // Create a PairRDD with all keys set to null and wrap each Metrics
in serializable objects
         val metricsToBeSaved = metrics.map(metricRecord => (null, new
SerializableMetrics(new     Metrics(metricRecord._1, metricRecord._2._1,
metricRecord._2._2))));

         metricsToBeSaved.coalesce(1500)
         // Save the RDD to a Parquet file in our temporary output directory
         metricsToBeSaved.saveAsNewAPIHadoopFile(filePath, classOf[Void],
classOf[Metrics],
           classOf[ParquetOutputFormat[Metrics]], job.getConfiguration)


https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-files-not-getting-coalesced-to-smaller-number-of-files-tp25509.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Parquet files not getting coalesced to smaller number of files

Reply via email to