Thanks a lot oubrik,
I got your point, my consideration is that sum() should be already a
built-in function for iterators in python.
Anyway I tried your approach
def mysum(iter):
count = sum = 0
for item in iter:
count += 1
sum += item
return sum
wordCountsGrouped = wordsGrouped.groupByKey().map(lambda
(w,iterator):(w,mysum(iterator)))
print wordCountsGrouped.collect()
but i get the error below, any idea?
TypeError: unsupported operand type(s) for +=: 'int' and 'ResultIterable'
at
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
at
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:176)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
thx
Leonida
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Sum-elements-of-an-iterator-inside-an-RDD-tp23775p23778.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]