Re: how to set random seed

2015-05-14 Thread Charles Hayden
): random.seed(my_seed) yield my_seed rdd.mapPartitions(f) From: ayan guha Sent: Thursday, May 14, 2015 2:29 AM To: Charles Hayden Cc: user Subject: Re: how to set random seed Sorry for late reply. Here is what I was thinking import random as r def main

Re: how to set random seed

2015-05-13 Thread Charles Hayden
?Can you elaborate? Broadcast will distribute the seed, which is only one number. But what construct do I use to "plant" the seed (call random.seed()) once on each worker? From: ayan guha Sent: Tuesday, May 12, 2015 11:17 PM To: Charles Hayden Cc: us

how to set random seed

2015-05-12 Thread Charles Hayden
In pySpark, I am writing a map with a lambda that calls random.shuffle. For testing, I want to be able to give it a seed, so that successive runs will produce the same shuffle. I am looking for a way to set this same random seed once on each worker. Is there any simple way to do it??

pyspark error with zip

2015-03-31 Thread Charles Hayden
? The following program fails in the zip step. x = sc.parallelize([1, 2, 3, 1, 2, 3]) y = sc.parallelize([1, 2, 3]) z = x.distinct() print x.zip(y).collect() The error that is produced depends on whether multiple partitions have been specified or not. I understand that the two RDDs [must] ha

Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Charles Hayden
?You could also consider using a count-min data structure such as in https://github.com/laserson/dsq? to get approximate quantiles, then use whatever values you want to filter the original sequence. From: Debasish Das Sent: Thursday, March 26, 2015 9:45 PM To: