Hi all, Spark is taking too much time to start the first stage with many small files in HDFS.
I am reading a folder that contains RC files: sc.hadoopFile("hdfs://hostname :8020/test_data2gb/", classOf[RCFileInputFormat[LongWritable, BytesRefArrayWritable]], classOf[LongWritable], classOf[BytesRefArrayWritable]) And parse: val parsedData = file.map((tuple: (LongWritable, BytesRefArrayWritable)) => RCFileUtil.getData(tuple._2)) 620 3mb files (2Gb total) takes considerable more time to start the first stage than 200 40mb 8gb total. Do you have any idea about the reason? Thanks! Best Regards, Cem Cayiroglu