Re: Iterator over RDD in PySpark

2014-08-02 Thread Andrei
Excellent, thank you! On Sat, Aug 2, 2014 at 4:46 AM, Aaron Davidson wrote: > Ah, that's unfortunate, that definitely should be added. Using a > pyspark-internal method, you could try something like > > javaIterator = rdd._jrdd.toLocalIterator() > it = rdd._collect_iterator_through_file(javaIte

Re: Iterator over RDD in PySpark

2014-08-01 Thread Aaron Davidson
Ah, that's unfortunate, that definitely should be added. Using a pyspark-internal method, you could try something like javaIterator = rdd._jrdd.toLocalIterator() it = rdd._collect_iterator_through_file(javaIterator) On Fri, Aug 1, 2014 at 3:04 PM, Andrei wrote: > Thanks, Aaron, it should be fi

Re: Iterator over RDD in PySpark

2014-08-01 Thread Andrei
Thanks, Aaron, it should be fine with partitions (I can repartition it anyway, right?). But rdd.toLocalIterator is purely Java/Scala method. Is there Python interface to it? I can get Java iterator though rdd._jrdd, but it isn't converted to Python iterator automatically. E.g.: >>> rdd = sc.para

Re: Iterator over RDD in PySpark

2014-08-01 Thread Aaron Davidson
rdd.toLocalIterator will do almost what you want, but requires that each individual partition fits in memory (rather than each individual line). Hopefully that's sufficient, though. On Fri, Aug 1, 2014 at 1:38 AM, Andrei wrote: > Is there a way to get iterator from RDD? Something like rdd.colle