rdd.py portable_hash()

Ted Yu Fri, 27 Nov 2015 11:51:06 -0800

ec2/spark-ec2 calls ./ec2/spark_ec2.py

I don't see PYTHONHASHSEED defined in any of these scripts.


Andy reported this for ec2 cluster.

I think a JIRA should be opened.


On Fri, Nov 27, 2015 at 11:01 AM, Felix Cheung <[email protected]>
wrote:

> May I ask how you are starting Spark?
> It looks like PYTHONHASHSEED is being set:
> https://github.com/apache/spark/search?utf8=%E2%9C%93&q=PYTHONHASHSEED
>
>
> ------------------------------
> Date: Thu, 26 Nov 2015 11:30:09 -0800
> Subject: possible bug spark/python/pyspark/rdd.py portable_hash()
> From: [email protected]
> To: [email protected]
>
> I am using spark-1.5.1-bin-hadoop2.6. I used
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and
> configured spark-env to use python3. I get and exception 'Randomness of
> hash of string should be disabled via PYTHONHASHSEED’. Is there any
> reason rdd.py should not just set PYTHONHASHSEED ?
>
> Should I file a bug?
>
> Kind regards
>
> Andy
>
> details
>
>
> http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract#pyspark.RDD.subtract
>
> Example does not work out of the box
>
> Subtract(*other*, *numPartitions=None*)
> <http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract#pyspark.RDD.subtract>
>
> Return each value in self that is not contained in other.
>
> >>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])>>> y = 
> >>> sc.parallelize([("a", 3), ("c", None)])>>> 
> >>> sorted(x.subtract(y).collect())[('a', 1), ('b', 4), ('b', 5)]
>
> It raises
>
>     if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ:
>         raise Exception("Randomness of hash of string should be disabled via 
> PYTHONHASHSEED")
>
>
>
> *The following script fixes the problem *
>
> Sudo printf "\n# set PYTHONHASHSEED so python3 will not generate
> Exception'Randomness of hash of string should be disabled via
> PYTHONHASHSEED'\nexport PYTHONHASHSEED=123\n" >> /root/spark/conf
> /spark-env.sh
>
> sudo pssh -i -h /root/spark-ec2/slaves cp /root/spark/conf/spark-env.sh
> /root/spark/conf/spark-env.sh-`date "+%Y-%m-%d:%H:%M"`
>
> Sudo for i in `cat slaves` ; do scp spark-env.sh root@$i:/root/spark/conf
> /spark-env.sh; done
>
>
>
>

Re: possible bug spark/python/pyspark/rdd.py portable_hash()

Reply via email to