Hi,

is there a way such that I can group items in an RDD together such that I
can process them using parallelize/map

Let's say I have data items with keys 1...1000 e.g. 
loading RDD = sc. newAPIHadoopFile(...).cache()

Now, I would like them to be processed in chunks of e.g. tens
chunk1=[0..9],chunk2=[10..19],...,chunk100=[991..999]

sc.parallelize([chunk1,....,chunk100]).map(process my chunk)

I thought I could use groupBy() or something like that but the return-type
is PipelinedRDD, which is not iterable.
Anybody an idea?
Thanks in advance,
 Tassilo






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Grouping-tp12407.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to