Hi, is there a way such that I can group items in an RDD together such that I can process them using parallelize/map
Let's say I have data items with keys 1...1000 e.g. loading RDD = sc. newAPIHadoopFile(...).cache() Now, I would like them to be processed in chunks of e.g. tens chunk1=[0..9],chunk2=[10..19],...,chunk100=[991..999] sc.parallelize([chunk1,....,chunk100]).map(process my chunk) I thought I could use groupBy() or something like that but the return-type is PipelinedRDD, which is not iterable. Anybody an idea? Thanks in advance, Tassilo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Grouping-tp12407.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org