Ah, it's there in spark-submit and pyspark.Seems like it should be added for
spark_ec2
_____________________________
From: Ted Yu <[email protected]>
Sent: Friday, November 27, 2015 11:50 AM
Subject: Re: possible bug spark/python/pyspark/rdd.py portable_hash()
To: Felix Cheung <[email protected]>
Cc: Andy Davidson <[email protected]>, user @spark
<[email protected]>
ec2/spark-ec2 calls ./ec2/spark_ec2.py
I don't see PYTHONHASHSEED defined in any of these scripts.
Andy reported this for ec2 cluster.
I think a JIRA should be opened.
On Fri, Nov 27, 2015 at 11:01 AM, Felix Cheung
<[email protected]> wrote:
May I ask how you are starting Spark?
It looks like PYTHONHASHSEED is being set:
https://github.com/apache/spark/search?utf8=%E2%9C%93&q=PYTHONHASHSEED
Date: Thu, 26 Nov 2015 11:30:09 -0800
Subject: possible bug spark/python/pyspark/rdd.py portable_hash()
From: [email protected]
To: [email protected]
I am using spark-1.5.1-bin-hadoop2.6. I used
spark-1.5.1-bin-hadoop2.6/ec2/s park-ec2 to create a cluster
and configured spark-env to use python3. I get and exception '
Randomness of hash of string should be disabled via PYTHONHASHSEED’.
Is there any reason rdd.py should not just set PYTHONHASHSEED ?
Should I file a bug?
Kind regards
Andy
details
http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract#pyspark.RDD.subtract
Example does not work out of the box
Subtract( other,
numPartitions=None)
Return each value in self that is not contained in other.
>>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a",
3)])>>> y = sc.parallelize([("a", 3), ("c", None)])>>>
sorted(x.subtract(y).collect())[('a', 1), ('b', 4), ('b', 5)]
It raises
if sys.version >= '3.3' and 'PYTHONHASHSEED' not
in os.environ: raise Exception("Randomness of hash of string should be
disabled via PYTHONHASHSEED")
The following script fixes the problem
Sudo printf "
# set PYTHONHASHSEED so python3 will not generate Exception'Randomness of hash
of string should be disabled via PYTHONHASHSEED'
export PYTHONHASHSEED=123
" >> /root/spark/conf/spark-env.sh
sudo pssh -i -h /root/spark-ec2/slaves cp
/root/spark/conf/spark-env.sh /root/spark/conf/spark-env.sh-`date
"+%Y-%m-%d:%H:%M"`
Sudo for i in `cat slaves` ; do scp spark-env.sh
root@$i:/root/spark/conf/spark-env.sh; done