Hi Felix and Ted
This is how I am starting spark
Should I file a bug?
Andy
export PYSPARK_PYTHON=python3.4
export PYSPARK_DRIVER_PYTHON=python3.4
export IPYTHON_OPTS="notebook --no-browser --port=7000 --log-level=WARN"
$SPARK_ROOT/bin/pyspark \
--master $MASTER_URL \
--total-executor-cores $numCores \
--driver-memory 2G \
--executor-memory 2G \
$extraPkgs \
$*
From: Felix Cheung <[email protected]>
Date: Saturday, November 28, 2015 at 12:11 AM
To: Ted Yu <[email protected]>
Cc: Andrew Davidson <[email protected]>, "user @spark"
<[email protected]>
Subject: Re: possible bug spark/python/pyspark/rdd.py portable_hash()
>
> Ah, it's there in spark-submit and pyspark.
> Seems like it should be added for spark_ec2
>
>
>
> _____________________________
> From: Ted Yu <[email protected]>
> Sent: Friday, November 27, 2015 11:50 AM
> Subject: Re: possible bug spark/python/pyspark/rdd.py portable_hash()
> To: Felix Cheung <[email protected]>
> Cc: Andy Davidson <[email protected]>, user @spark
> <[email protected]>
>
>
>
> ec2/spark-ec2 calls ./ec2/spark_ec2.py
>
>
>
>
> I don't see PYTHONHASHSEED defined in any of these scripts.
>
>
>
>
> Andy reported this for ec2 cluster.
>
>
>
>
> I think a JIRA should be opened.
>
>
>
>
>
>
>
> On Fri, Nov 27, 2015 at 11:01 AM, Felix Cheung
> <[email protected]> wrote:
>
>>
>>
>> May I ask how you are starting Spark?
>> It looks like PYTHONHASHSEED is being set:
>> https://github.com/apache/spark/search?utf8=%E2%9C%93&q=PYTHONHASHSEED
>>
>>
>>
>>
>>
>> Date: Thu, 26 Nov 2015 11:30:09 -0800
>> Subject: possible bug spark/python/pyspark/rdd.py portable_hash()
>> From: [email protected]
>> To: [email protected]
>>
>>
>> I am using spark-1.5.1-bin-hadoop2.6. I used
>> spark-1.5.1-bin-hadoop2.6/ec2/s park-ec2 to create a cluster
>> and configured spark-env to use python3. I get and exception '
>> Randomness of hash of string should be disabled via PYTHONHASHSEED¹.
>> Is there any reason rdd.py should not just set PYTHONHASHSEED ?
>>
>>
>>
>>
>> Should I file a bug?
>>
>>
>>
>>
>> Kind regards
>>
>>
>>
>>
>> Andy
>>
>>
>>
>>
>> details
>>
>>
>>
>>
>>
>> http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtrac
>> t#pyspark.RDD.subtract
>>
>>
>>
>>
>> Example does not work out of the box
>>
>>
>>
>>
>> Subtract( other,
>> numPartitions=None)
>> <http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtra
>> ct#pyspark.RDD.subtract>
>>
>> Return each value in self that is not contained in other.
>>
>>
>>
>>>>> >>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])>>> y =
>>>>> sc.parallelize([("a", 3), ("c", None)])>>>
>>>>> sorted(x.subtract(y).collect())[('a', 1), ('b', 4), ('b', 5)]
>>
>>
>>
>> It raises
>>
>>
>>
>>
>> if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ:
>> raise Exception("Randomness of hash of string should be disabled via
>> PYTHONHASHSEED")
>>
>>
>>
>>
>>
>>
>>
>>
>> The following script fixes the problem
>>
>>
>>
>>
>> Sudo printf "\n# set PYTHONHASHSEED so python3 will not generate
>> Exception'Randomness of hash of string should be disabled via
>> PYTHONHASHSEED'\nexport PYTHONHASHSEED=123\n" >>
>> /root/spark/conf/spark-env.sh
>>
>>
>>
>>
>> sudo pssh -i -h /root/spark-ec2/slaves cp
>> /root/spark/conf/spark-env.sh /root/spark/conf/spark-env.sh-`date
>> "+%Y-%m-%d:%H:%M"`
>>
>>
>>
>>
>> Sudo for i in `cat slaves` ; do scp spark-env.sh
>> root@$i:/root/spark/conf/spark-env.sh; done
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>
>