ec2/spark-ec2 calls ./ec2/spark_ec2.py I don't see PYTHONHASHSEED defined in any of these scripts.
Andy reported this for ec2 cluster. I think a JIRA should be opened. On Fri, Nov 27, 2015 at 11:01 AM, Felix Cheung <[email protected]> wrote: > May I ask how you are starting Spark? > It looks like PYTHONHASHSEED is being set: > https://github.com/apache/spark/search?utf8=%E2%9C%93&q=PYTHONHASHSEED > > > ------------------------------ > Date: Thu, 26 Nov 2015 11:30:09 -0800 > Subject: possible bug spark/python/pyspark/rdd.py portable_hash() > From: [email protected] > To: [email protected] > > I am using spark-1.5.1-bin-hadoop2.6. I used > spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and > configured spark-env to use python3. I get and exception 'Randomness of > hash of string should be disabled via PYTHONHASHSEED’. Is there any > reason rdd.py should not just set PYTHONHASHSEED ? > > Should I file a bug? > > Kind regards > > Andy > > details > > > http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract#pyspark.RDD.subtract > > Example does not work out of the box > > Subtract(*other*, *numPartitions=None*) > <http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract#pyspark.RDD.subtract> > > Return each value in self that is not contained in other. > > >>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])>>> y = > >>> sc.parallelize([("a", 3), ("c", None)])>>> > >>> sorted(x.subtract(y).collect())[('a', 1), ('b', 4), ('b', 5)] > > It raises > > if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ: > raise Exception("Randomness of hash of string should be disabled via > PYTHONHASHSEED") > > > > *The following script fixes the problem * > > Sudo printf "\n# set PYTHONHASHSEED so python3 will not generate > Exception'Randomness of hash of string should be disabled via > PYTHONHASHSEED'\nexport PYTHONHASHSEED=123\n" >> /root/spark/conf > /spark-env.sh > > sudo pssh -i -h /root/spark-ec2/slaves cp /root/spark/conf/spark-env.sh > /root/spark/conf/spark-env.sh-`date "+%Y-%m-%d:%H:%M"` > > Sudo for i in `cat slaves` ; do scp spark-env.sh root@$i:/root/spark/conf > /spark-env.sh; done > > > >
