Will be in 1.0.0
On Wed, Apr 2, 2014 at 3:22 PM, Nicholas Chammas wrote:
> Ah, now I see what Aaron was referring to. So I'm guessing we will get
> this in the next release or two. Thank you.
>
>
>
> On Wed, Apr 2, 2014 at 6:09 PM, Mark Hamstra wrote:
>
>> There is a repartition method in pyspa
Ah, now I see what Aaron was referring to. So I'm guessing we will get this
in the next release or two. Thank you.
On Wed, Apr 2, 2014 at 6:09 PM, Mark Hamstra wrote:
> There is a repartition method in pyspark master:
> https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L1128
>
>
There is a repartition method in pyspark master:
https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L1128
On Wed, Apr 2, 2014 at 2:44 PM, Nicholas Chammas wrote:
> Update: I'm now using this ghetto function to partition the RDD I get back
> when I call textFile() on a gzipped fil
Update: I'm now using this ghetto function to partition the RDD I get back
when I call textFile() on a gzipped file:
# Python 2.6
def partitionRDD(rdd, numPartitions):
counter = {'a': 0}
def count_up(x):
counter['a'] += 1
return counter['a']
return (rdd.keyBy(count_up)
Hmm, doing help(rdd) in PySpark doesn't show a method called repartition().
Trying rdd.repartition() or rdd.repartition(10) also fail. I'm on 0.9.0.
The approach I'm going with to partition my MappedRDD is to key it by a
random int, and then partition it.
So something like:
rdd = sc.textFile('s3
Hm, yeah, the docs are not clear on this one. The function you're looking
for to change the number of partitions on any ol' RDD is "repartition()",
which is available in master but for some reason doesn't seem to show up in
the latest docs. Sorry about that, I also didn't realize partitionBy() had