How does wholeTextFiles() work in Spark-Hadoop Cluster?

Nisha Menon Wed, 07 Sep 2016 23:59:47 -0700

I have an RDD created as follows:

*    JavaPairRDD<String,String> inputDataFiles =
sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");*


On this RDD I perform a map to process individual files and invoke a
foreach to trigger the same map.

   * JavaRDD<Object[]> output = inputDataFiles.map(new
Function<Tuple2<String,String>,Object[]>()*
*    {*

*        private static final long serialVersionUID = 1L;*

* @Override*
* public Object[] call(Tuple2<String,String> v1) throws Exception *
*            { *
*              System.out.println("in map!");*
*               //do something with v1. *
*              return Object[]*
*            } *
*    });*

*    output.foreach(new VoidFunction<Object[]>() {*

* private static final long serialVersionUID = 1L;*

* @Override*
* public void call(Object[] t) throws Exception {*
* //do nothing!*
* System.out.println("in foreach!");*
* }*
* }); *

This code works perfectly fine for standalone setup on my local laptop
while accessing both local files as well as remote HDFS files.

In cluster the same code produces no results. My intuition is that the data
has not reached the individual executors and hence both the `map` and
`foreach` does not work. It might be a guess. But I am not able to figure
out why this would not work in cluster. I dont even see the print
statements in `map` and `foreach` getting printed in cluster mode of
execution.

I notice a particular line in standalone output that I do NOT see in
cluster execution.

    *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split:
Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345*

I had a similar code with textFile() that worked earlier for individual
files on cluster. The issue is with wholeTextFiles() only.

Please advise what is the best way to get this working or other alternate
ways.

My setup is cloudera 5.7 distribution with Spark Service. I used the master
as `yarn-client`.

The action can be anything. Its just a dummy step to invoke the map. I also
tried *System.out.println("Count is:"+output.count());*, for which I got
the correct answer of `10`, since there were 10 files in the folder, but
still the map refuses to work.

Thanks.

How does wholeTextFiles() work in Spark-Hadoop Cluster?

Reply via email to