rdd.py portable_hash()

Andy Davidson Sun, 29 Nov 2015 09:49:30 -0800

Hi Felix and Ted

This is how I am starting spark


Should I file a bug?

Andy


export PYSPARK_PYTHON=python3.4

export PYSPARK_DRIVER_PYTHON=python3.4

export IPYTHON_OPTS="notebook --no-browser --port=7000 --log-level=WARN"


$SPARK_ROOT/bin/pyspark \

    --master $MASTER_URL \

    --total-executor-cores $numCores \

    --driver-memory 2G \

    --executor-memory 2G \

    $extraPkgs \

    $*


From:  Felix Cheung <[email protected]>
Date:  Saturday, November 28, 2015 at 12:11 AM
To:  Ted Yu <[email protected]>
Cc:  Andrew Davidson <[email protected]>, "user @spark"
<[email protected]>
Subject:  Re: possible bug spark/python/pyspark/rdd.py portable_hash()

>  
> Ah, it's there in spark-submit and pyspark.
> Seems like it should be added for spark_ec2
> 
> 
>  
> _____________________________
> From: Ted Yu <[email protected]>
> Sent: Friday, November 27, 2015 11:50 AM
> Subject: Re: possible bug spark/python/pyspark/rdd.py portable_hash()
> To: Felix Cheung <[email protected]>
> Cc: Andy Davidson <[email protected]>, user @spark
> <[email protected]>
> 
> 
>     
>    ec2/spark-ec2 calls ./ec2/spark_ec2.py
>    
>     
>    
>    
>     I don't see PYTHONHASHSEED defined in any of these scripts.
>    
>     
>    
>    
>     Andy reported this for ec2 cluster.
>    
>     
>    
>    
>     I think a JIRA should be opened.
>    
>     
>    
>   
>   
>    
>    
>     On Fri, Nov 27, 2015 at 11:01 AM, Felix Cheung
> <[email protected]> wrote:
>     
>>       
>>       
>>        May I ask how you are starting Spark?
>> It looks like PYTHONHASHSEED is being set:
>> https://github.com/apache/spark/search?utf8=%E2%9C%93&q=PYTHONHASHSEED
>>        
>>         
>>        
>>         
>> 
>> Date: Thu, 26 Nov 2015 11:30:09 -0800
>> Subject: possible bug spark/python/pyspark/rdd.py portable_hash()
>> From: [email protected]
>> To:         [email protected]
>>         
>>         
>>          I am using          spark-1.5.1-bin-hadoop2.6. I used
>> spark-1.5.1-bin-hadoop2.6/ec2/s         park-ec2 to create a cluster
>> and configured spark-env to use python3. I get and exception '
>> Randomness of hash of string should be disabled via PYTHONHASHSEED¹.
>> Is there any reason rdd.py should not just set PYTHONHASHSEED ?
>>         
>>          
>>         
>>         
>>          Should I file a bug?
>>         
>>          
>>         
>>         
>>          Kind regards
>>         
>>          
>>         
>>         
>>          Andy   
>>         
>>          
>>         
>>         
>>          details
>>         
>>          
>>         
>>         
>>          
>> http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtrac
>> t#pyspark.RDD.subtract
>>         
>>          
>>         
>>         
>>          Example does not work out of the box
>>         
>>          
>>         
>>         
>>                               Subtract(           other,
>> numPartitions=None)
>> <http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtra
>> ct#pyspark.RDD.subtract>
>> 
>> Return each value in self that is not contained in other.
>>            
>>             
>>              
>>>>> >>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])>>> y =
>>>>> sc.parallelize([("a", 3), ("c", None)])>>>
>>>>> sorted(x.subtract(y).collect())[('a', 1), ('b', 4), ('b', 5)]
>>             
>>            
>>                 
>> It raises       
>>          
>>         
>>         
>>          
>>     if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ:
>> raise Exception("Randomness of hash of string should be disabled via
>> PYTHONHASHSEED")
>>         
>>         
>>          
>>         
>>         
>>          
>>         
>>         
>>          The following script fixes the problem
>>         
>>          
>>         
>>         
>>          Sudo printf "\n# set PYTHONHASHSEED so python3 will not generate
>> Exception'Randomness of hash of string should be disabled via
>> PYTHONHASHSEED'\nexport PYTHONHASHSEED=123\n" >>
>> /root/spark/conf/spark-env.sh
>>         
>>          
>>         
>>         
>>          sudo pssh -i -h /root/spark-ec2/slaves cp
>> /root/spark/conf/spark-env.sh /root/spark/conf/spark-env.sh-`date
>> "+%Y-%m-%d:%H:%M"`
>>         
>>          
>>         
>>         
>>          Sudo for i in `cat slaves` ; do scp spark-env.sh
>> root@$i:/root/spark/conf/spark-env.sh; done
>>         
>>          
>>         
>>         
>>          
>>         
>>         
>>          
>>         
>>        
>>        
>>      
>>      
>    
>    
>   
>   
> 
>

Re: possible bug spark/python/pyspark/rdd.py portable_hash()

Reply via email to