RE: Control default partition when load a RDD from HDFS

Diego García Valverde Wed, 17 Dec 2014 08:04:53 -0800

Why not is a good option to create a RDD per each 200Mb file and then apply the 
pre-calculations before merging them? I think the partitions per RDD must be 
transparent to the pre-calculations, and not to set them fixed to optimize the 
spark maps/reduces processes.


De: Shuai Zheng [mailto:szheng.c...@gmail.com]
Enviado el: miércoles, 17 de diciembre de 2014 16:01
Para: 'Sun, Rui'; user@spark.apache.org
Asunto: RE: Control default partition when load a RDD from HDFS

Nice, that is the answer I want.
Thanks!

From: Sun, Rui [mailto:rui....@intel.com]
Sent: Wednesday, December 17, 2014 1:30 AM
To: Shuai Zheng; user@spark.apache.org
Subject: RE: Control default partition when load a RDD from HDFS

Hi, Shuai,

How did you turn off the file split in Hadoop? I guess you might have 
implemented a customized FileInputFormat which overrides isSplitable() to 
return FALSE. If you do have such FileInputFormat, you can simply pass it as a 
constructor parameter to HadoopRDD or NewHadoopRDD in Spark.

From: Shuai Zheng [mailto:szheng.c...@gmail.com]
Sent: Wednesday, December 17, 2014 4:16 AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Control default partition when load a RDD from HDFS

Hi All,

My application load 1000 files, each file from 200M -  a few GB, and combine 
with other data to do calculation.
Some pre-calculation must be done on each file level, then after that, the 
result need to combine to do further calculation.
In Hadoop, it is simple because I can turn-off the file split for input format 
(to enforce each file will go to same mapper), then I will do the file level 
calculation in mapper and pass result to reducer. But in spark, how can I do it?
Basically I want to make sure after I load these files into RDD, it is 
partitioned by file (not split file and also no merge there), so I can call 
mapPartitions. Is it any way I can control the default partition when I load 
the RDD?
This might be the default behavior that spark do the partition (partitioned by 
file when first time load the RDD), but I can't find any document to support my 
guess, if not, can I enforce this kind of partition? Because the total file 
size is bigger, I don't want to re-partition in the code.

Regards,

Shuai

________________________________
Disclaimer: http://disclaimer.agbar.com

RE: Control default partition when load a RDD from HDFS

Reply via email to