Hi,
I just found out that we can have lots of empty input partitions when
reading from parquet files.
Sample code as following:
val hconf = sc.hadoopConfiguration
val job = new Job(hconf)
FileInputFormat.setInputPaths(job, new Path("path_to_data"))
ParquetInputFormat.setReadSupportClass(job,
classOf[AvroReadSupport[MyAvroType]])
val rdd = new NewHadoopRDD[Void, MyAvroType](
sc,
classOf[ParquetInputFormat[MyAvroType]],
classOf[Void],
classOf[MyAvroType],
job.getConfiguration
)
val ctx = rdd.newJobContext(job.getConfiguration, new JobID())
val inputFormat = new ParquetInputFormat[MyAvroType]()
inputFormat.getSplits(ctx).asScala.foreach(println)
val sizes = rdd.mapPartitions { iter =>
List(iter.size).iterator
}.collect().toList
sizes.foreach(println)
The splits are ok:
ParquetInputSplit{part: file:/folder/test_file start: 0 end: 33554432
length: 33554432 hosts: [localhost]}
ParquetInputSplit{part: file:/folder/test_file start: 33554432 end:
67108864 length: 33554432 hosts: [localhost]}
ParquetInputSplit{part: file:/folder/test_file start: 67108864 end:
100663296 length: 33554432 hosts: [localhost]}
ParquetInputSplit{part: file:/folder/test_file start: 100663296 end:
106022166 length: 5358870 hosts: [localhost]}
However the partition sizes are:
0
4365522
0
0
Essentially one partition has all the lines.
When reading using spark-sql, all is ok.
I'm using spark 1.6.1 and parquet-avro 1.7.0.
Thanks!
--
*JU Han*
Software Engineer @ Teads.tv
+33 0619608888