So this issue appears to be related to the other Python 2.7-related issue I reported in this thread<http://apache-spark-user-list.1001560.n3.nabble.com/java-net-SocketException-on-reduceByKey-in-pyspark-td2184.html> .
Shall I open a bug in JIRA about this and include the wikistat repro? Nick On Sun, Mar 2, 2014 at 1:50 AM, nicholas.chammas <nicholas.cham...@gmail.com > wrote: > Unexpected behavior. Here's the repro: > > 1. Launch an EC2 cluster with spark-ec2. 1 slave; default instance > type. > 2. Upgrade the cluster to Python 2.7 using the instructions > here<https://spark-project.atlassian.net/browse/SPARK-922?focusedCommentId=15711&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15711> > . > 3. pip2.7 install numpy > 4. Run this script in the pyspark shell: > > wikistat = sc.textFile('s3n://ACCESSKEY:SECRET@bigdatademo > /sample/wiki/pagecounts-20100212-050000.gz') > wikistat = wikistat.map(lambda x: x.split(' ')).cache() > wikistat.map(lambda x: (x[1], int(x[3]))).map(lambda x: > (x[1],x[0])).sortByKey(False).take(5) > > 5. You will see a long error output that includes a complaint about > NumPy not being installed. > 6. Now remove the sortByKey() from that last line and rerun it. > > wikistat.map(lambda x: (x[1], int(x[3]))).map(lambda x: > (x[1],x[0])).take(5) > > You should see your results without issue. So it's the sortByKey() > that's choking. > 7. Quit the pyspark shell and pip uninstall numpy. > 8. Rerun the three lines from step 4. Enjoy your sorted results > error-free. > > Can anyone else reproduce this issue? Is it a bug? I don't see it if I > leave the cluster on the default Python 2.6.8. > > Installing numpy on the slave via pssh and pip2.7 (so that it's identical > to the master) does not fix the issue. Dunno if installing Python packages > everywhere is even necessary though. > > Nick > > > ------------------------------ > View this message in context: Python 2.7 + numpy break > sortByKey()<http://apache-spark-user-list.1001560.n3.nabble.com/Python-2-7-numpy-break-sortByKey-tp2214.html> > Sent from the Apache Spark User List mailing list > archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com. >