Hi, I have the following code that saves the parquet files in my hourly batch to hdfs. My idea is to coalesce the files to 1500 smaller files. The first run it gives me 1500 files in hdfs. For the next runs the files seem to be increasing even though I coalesce.
Its not getting coalesced to 1500 files as I want. I also have an example that I am using in the end. Please let me know if there is a different and more efficient way of doing this. val job = Job.getInstance() var filePath = "path" val metricsPath: Path = new Path(filePath) //Check if inputFile exists val fs: FileSystem = FileSystem.get(job.getConfiguration) if (fs.exists(metricsPath)) { fs.delete(metricsPath, true) } // Configure the ParquetOutputFormat to use Avro as the serialization format ParquetOutputFormat.setWriteSupportClass(job, classOf[AvroWriteSupport]) // You need to pass the schema to AvroParquet when you are writing objects but not when you // are reading them. The schema is saved in Parquet file for future readers to use. AvroParquetOutputFormat.setSchema(job, Metrics.SCHEMA$) // Create a PairRDD with all keys set to null and wrap each Metrics in serializable objects val metricsToBeSaved = metrics.map(metricRecord => (null, new SerializableMetrics(new Metrics(metricRecord._1, metricRecord._2._1, metricRecord._2._2)))); metricsToBeSaved.coalesce(1500) // Save the RDD to a Parquet file in our temporary output directory metricsToBeSaved.saveAsNewAPIHadoopFile(filePath, classOf[Void], classOf[Metrics], classOf[ParquetOutputFormat[Metrics]], job.getConfiguration) https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-files-not-getting-coalesced-to-smaller-number-of-files-tp25509.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org