I have an RDD created as follows: * JavaPairRDD<String,String> inputDataFiles = sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");*
On this RDD I perform a map to process individual files and invoke a foreach to trigger the same map. * JavaRDD<Object[]> output = inputDataFiles.map(new Function<Tuple2<String,String>,Object[]>()* * {* * private static final long serialVersionUID = 1L;* * @Override* * public Object[] call(Tuple2<String,String> v1) throws Exception * * { * * System.out.println("in map!");* * //do something with v1. * * return Object[]* * } * * });* * output.foreach(new VoidFunction<Object[]>() {* * private static final long serialVersionUID = 1L;* * @Override* * public void call(Object[] t) throws Exception {* * //do nothing!* * System.out.println("in foreach!");* * }* * }); * This code works perfectly fine for standalone setup on my local laptop while accessing both local files as well as remote HDFS files. In cluster the same code produces no results. My intuition is that the data has not reached the individual executors and hence both the `map` and `foreach` does not work. It might be a guess. But I am not able to figure out why this would not work in cluster. I dont even see the print statements in `map` and `foreach` getting printed in cluster mode of execution. I notice a particular line in standalone output that I do NOT see in cluster execution. *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split: Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345* I had a similar code with textFile() that worked earlier for individual files on cluster. The issue is with wholeTextFiles() only. Please advise what is the best way to get this working or other alternate ways. My setup is cloudera 5.7 distribution with Spark Service. I used the master as `yarn-client`. The action can be anything. Its just a dummy step to invoke the map. I also tried *System.out.println("Count is:"+output.count());*, for which I got the correct answer of `10`, since there were 10 files in the folder, but still the map refuses to work. Thanks.