Actually upon closer look PYTHONHASHSEED should be set (in worker) when your
create a SparkContext
https://github.com/apache/spark/blob/master/python/pyspark/context.py#L166
And it should also be set from spark-submit or pyspark.
Can you check sys.version and os.environ.get("PYTHONHASHSEED")?
Date: Sun, 29 Nov 2015 09:48:19 -0800
Subject: Re: possible bug spark/python/pyspark/rdd.py portable_hash()
From: [email protected]
To: [email protected]; [email protected]
CC: [email protected]
Hi Felix and Ted
This is how I am starting spark
Should I file a bug?
Andy
export PYSPARK_PYTHON=python3.4
export PYSPARK_DRIVER_PYTHON=python3.4
export IPYTHON_OPTS="notebook --no-browser --port=7000 --log-level=WARN"
$SPARK_ROOT/bin/pyspark \
--master $MASTER_URL \
--total-executor-cores $numCores \
--driver-memory 2G \
--executor-memory 2G \
$extraPkgs \
$*
From: Felix Cheung <[email protected]>
Date: Saturday, November 28, 2015 at 12:11 AM
To: Ted Yu <[email protected]>
Cc: Andrew Davidson <[email protected]>, "user @spark"
<[email protected]>
Subject: Re: possible bug spark/python/pyspark/rdd.py portable_hash()
Ah, it's there in spark-submit and pyspark.Seems like it should be added
for spark_ec2
_____________________________
From: Ted Yu <[email protected]>
Sent: Friday, November 27, 2015 11:50 AM
Subject: Re: possible bug spark/python/pyspark/rdd.py portable_hash()
To: Felix Cheung <[email protected]>
Cc: Andy Davidson <[email protected]>, user @spark
<[email protected]>
ec2/spark-ec2 calls ./ec2/spark_ec2.py
I don't see PYTHONHASHSEED defined in any of these scripts.
Andy reported this for ec2 cluster.
I think a JIRA should be opened.
On Fri, Nov 27, 2015 at 11:01 AM, Felix Cheung
<[email protected]> wrote:
May I ask how you are starting Spark?
It looks like PYTHONHASHSEED is being set:
https://github.com/apache/spark/search?utf8=%E2%9C%93&q=PYTHONHASHSEED
Date: Thu, 26 Nov 2015 11:30:09 -0800
Subject: possible bug spark/python/pyspark/rdd.py portable_hash()
From: [email protected]
To: [email protected]
I am using spark-1.5.1-bin-hadoop2.6. I used
spark-1.5.1-bin-hadoop2.6/ec2/s park-ec2 to create a cluster
and configured spark-env to use python3. I get and exception '
Randomness of hash of string should be disabled via PYTHONHASHSEED’.
Is there any reason rdd.py should not just set PYTHONHASHSEED ?
Should I file a bug?
Kind regards
Andy
details
http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract#pyspark.RDD.subtract
Example does not work out of the box
Subtract( other,
numPartitions=None) Return each
value in self that is not contained in other.
>>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])>>> y =
sc.parallelize([("a", 3), ("c", None)])>>>
sorted(x.subtract(y).collect())[('a', 1), ('b', 4), ('b', 5)]
It raises
if sys.version >= '3.3' and 'PYTHONHASHSEED' not
in os.environ: raise Exception("Randomness of hash of string should be
disabled via PYTHONHASHSEED")
The following script fixes the problem
Sudo printf "\n# set PYTHONHASHSEED so python3 will
not generate Exception'Randomness of hash of string should be disabled via
PYTHONHASHSEED'\nexport PYTHONHASHSEED=123\n" >> /root/spark/conf/spark-env.sh
sudo pssh -i -h /root/spark-ec2/slaves cp
/root/spark/conf/spark-env.sh /root/spark/conf/spark-env.sh-`date
"+%Y-%m-%d:%H:%M"`
Sudo for i in `cat slaves` ; do scp spark-env.sh
root@$i:/root/spark/conf/spark-env.sh; done