*TL;DR:* I want to run a pre-processing step on the data from each partition (such as parsing) and retain the parsed object on each node for future processing calls to avoid repeated parsing.
/More detail:/ I have a server and two nodes in my cluster, and data partitioned using hdfs. I am trying to use spark to process the data and send back results. The data is available as text, and I would like to first parse this text, and then run future processing. To do this, I call a simple: JavaRDD.foreachPartition(Iterator<String>)(new VoidFunction<Iterator<String>>(){ @Override public void call(Iterator<String> i){ ParsedData p=new ParsedData(i); } }); I would like to retain this ParsedData object on each node for future processing calls, so as to avoid parsing all over again. So in my next call, I'd like to do something like this: JavaRDD.foreachPartition(Iterator<String>)(new VoidFunction<Iterator<String>>(){ @Override public void call(Iterator<String> i){ //refer to previously created ParsedData object p.process(); //accumulate some results } }); -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Persistent-Local-Node-variables-tp8104.html Sent from the Apache Spark User List mailing list archive at Nabble.com.