Unexpected behavior. Here's the repro: 1. Launch an EC2 cluster with spark-ec2. 1 slave; default instance type. 2. Upgrade the cluster to Python 2.7 using the instructions here<https://spark-project.atlassian.net/browse/SPARK-922?focusedCommentId=15711&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15711> . 3. pip2.7 install numpy 4. Run this script in the pyspark shell:
wikistat = sc.textFile('s3n://ACCESSKEY:SECRET@bigdatademo /sample/wiki/pagecounts-20100212-050000.gz') wikistat = wikistat.map(lambda x: x.split(' ')).cache() wikistat.map(lambda x: (x[1], int(x[3]))).map(lambda x: (x[1],x[0])).sortByKey(False).take(5) 5. You will see a long error output that includes a complaint about NumPy not being installed. 6. Now remove the sortByKey() from that last line and rerun it. wikistat.map(lambda x: (x[1], int(x[3]))).map(lambda x: (x[1],x[0])).take(5) You should see your results without issue. So it's the sortByKey() that's choking. 7. Quit the pyspark shell and pip uninstall numpy. 8. Rerun the three lines from step 4. Enjoy your sorted results error-free. Can anyone else reproduce this issue? Is it a bug? I don't see it if I leave the cluster on the default Python 2.6.8. Installing numpy on the slave via pssh and pip2.7 (so that it's identical to the master) does not fix the issue. Dunno if installing Python packages everywhere is even necessary though. Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-2-7-numpy-break-sortByKey-tp2214.html Sent from the Apache Spark User List mailing list archive at Nabble.com.