Spark processing small files.

cem Tue, 16 Sep 2014 13:07:13 -0700

Hi all,

Spark is taking too much time to start the first stage with many small
files in HDFS.


I am reading a folder that contains RC files:

sc.hadoopFile("hdfs://hostname :8020/test_data2gb/",
classOf[RCFileInputFormat[LongWritable, BytesRefArrayWritable]],
classOf[LongWritable], classOf[BytesRefArrayWritable])

And parse:
 val parsedData = file.map((tuple: (LongWritable, BytesRefArrayWritable))
=>  RCFileUtil.getData(tuple._2))

620 3mb files (2Gb total) takes considerable more time to start the first
stage than 200 40mb 8gb total.


Do you have any idea about the reason? Thanks!

Best Regards,
Cem Cayiroglu

Spark processing small files.

Reply via email to