There is a repartition method in pyspark master: https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L1128
On Wed, Apr 2, 2014 at 2:44 PM, Nicholas Chammas <[email protected] > wrote: > Update: I'm now using this ghetto function to partition the RDD I get back > when I call textFile() on a gzipped file: > > # Python 2.6 > def partitionRDD(rdd, numPartitions): > counter = {'a': 0} > def count_up(x): > counter['a'] += 1 > return counter['a'] > return (rdd.keyBy(count_up) > .partitionBy(numPartitions) > .map(lambda (counter, data): data)) > > If there's supposed to be a built-in Spark method to do this, I'd love to > learn more about it. > > Nick > > > On Tue, Apr 1, 2014 at 7:59 PM, Nicholas Chammas < > [email protected]> wrote: > >> Hmm, doing help(rdd) in PySpark doesn't show a method called >> repartition(). Trying rdd.repartition() or rdd.repartition(10) also >> fail. I'm on 0.9.0. >> >> The approach I'm going with to partition my MappedRDD is to key it by a >> random int, and then partition it. >> >> So something like: >> >> rdd = sc.textFile('s3n://gzipped_file_brah.gz') # rdd has 1 partition; >> minSplits is not actionable due to gzip >> keyed_rdd = rdd.keyBy(lambda x: randint(1,100)) # we key the RDD so we >> can partition it >> partitioned_rdd = keyed_rdd.partitionBy(10) # rdd has 10 partitions >> >> Are you saying I don't have to do this? >> >> Nick >> >> >> >> On Tue, Apr 1, 2014 at 7:38 PM, Aaron Davidson <[email protected]>wrote: >> >>> Hm, yeah, the docs are not clear on this one. The function you're >>> looking for to change the number of partitions on any ol' RDD is >>> "repartition()", which is available in master but for some reason doesn't >>> seem to show up in the latest docs. Sorry about that, I also didn't realize >>> partitionBy() had this behavior from reading the Python docs (though it is >>> consistent with the Scala API, just more type-safe there). >>> >>> >>> On Tue, Apr 1, 2014 at 3:01 PM, Nicholas Chammas < >>> [email protected]> wrote: >>> >>>> Just an FYI, it's not obvious from the >>>> docs<http://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#partitionBy>that >>>> the following code should fail: >>>> >>>> a = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2) >>>> a._jrdd.splits().size() >>>> a.count() >>>> b = a.partitionBy(5) >>>> b._jrdd.splits().size() >>>> b.count() >>>> >>>> I figured out from the example that if I generated a key by doing this >>>> >>>> b = a.map(lambda x: (x, x)).partitionBy(5) >>>> >>>> then all would be well. >>>> >>>> In other words, partitionBy() only works on RDDs of tuples. Is that >>>> correct? >>>> >>>> Nick >>>> >>>> >>>> ------------------------------ >>>> View this message in context: PySpark RDD.partitionBy() requires an >>>> RDD of >>>> tuples<http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-RDD-partitionBy-requires-an-RDD-of-tuples-tp3598.html> >>>> Sent from the Apache Spark User List mailing list >>>> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com. >>>> >>> >>> >> >
