):
random.seed(my_seed)
yield my_seed
rdd.mapPartitions(f)
From: ayan guha
Sent: Thursday, May 14, 2015 2:29 AM
To: Charles Hayden
Cc: user
Subject: Re: how to set random seed
Sorry for late reply.
Here is what I was thinking
import random as r
def main
?Can you elaborate? Broadcast will distribute the seed, which is only one
number. But what construct do I use to "plant" the seed (call random.seed())
once on each worker?
From: ayan guha
Sent: Tuesday, May 12, 2015 11:17 PM
To: Charles Hayden
Cc: us
In pySpark, I am writing a map with a lambda that calls random.shuffle.
For testing, I want to be able to give it a seed, so that successive runs will
produce the same shuffle.
I am looking for a way to set this same random seed once on each worker. Is
there any simple way to do it??
?
The following program fails in the zip step.
x = sc.parallelize([1, 2, 3, 1, 2, 3])
y = sc.parallelize([1, 2, 3])
z = x.distinct()
print x.zip(y).collect()
The error that is produced depends on whether multiple partitions have been
specified or not.
I understand that
the two RDDs [must] ha
?You could also consider using a count-min data structure such as in
https://github.com/laserson/dsq?
to get approximate quantiles, then use whatever values you want to filter the
original sequence.
From: Debasish Das
Sent: Thursday, March 26, 2015 9:45 PM
To: