RDD.coalesce(n) returns a new RDD rather than modifying the original
RDD. So what you need is:
metricsToBeSaved.coalesce(1500).saveAsNewAPIHadoopFile(...)
Cheng
On 11/29/15 12:21 PM, SRK wrote:
Hi,
I have the following code that saves the parquet files in my hourly batch to
hdfs. My idea is to coalesce the files to 1500 smaller files. The first run
it gives me 1500 files in hdfs. For the next runs the files seem to be
increasing even though I coalesce.
Its not getting coalesced to 1500 files as I want. I also have an example
that I am using in the end. Please let me know if there is a different and
more efficient way of doing this.
val job = Job.getInstance()
var filePath = "path"
val metricsPath: Path = new Path(filePath)
//Check if inputFile exists
val fs: FileSystem = FileSystem.get(job.getConfiguration)
if (fs.exists(metricsPath)) {
fs.delete(metricsPath, true)
}
// Configure the ParquetOutputFormat to use Avro as the
serialization format
ParquetOutputFormat.setWriteSupportClass(job,
classOf[AvroWriteSupport])
// You need to pass the schema to AvroParquet when you are writing
objects but not when you
// are reading them. The schema is saved in Parquet file for future
readers to use.
AvroParquetOutputFormat.setSchema(job, Metrics.SCHEMA$)
// Create a PairRDD with all keys set to null and wrap each Metrics
in serializable objects
val metricsToBeSaved = metrics.map(metricRecord => (null, new
SerializableMetrics(new Metrics(metricRecord._1, metricRecord._2._1,
metricRecord._2._2))));
metricsToBeSaved.coalesce(1500)
// Save the RDD to a Parquet file in our temporary output directory
metricsToBeSaved.saveAsNewAPIHadoopFile(filePath, classOf[Void],
classOf[Metrics],
classOf[ParquetOutputFormat[Metrics]], job.getConfiguration)
https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-files-not-getting-coalesced-to-smaller-number-of-files-tp25509.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org