Unexpected behavior. Here's the repro:

   1. Launch an EC2 cluster with spark-ec2. 1 slave; default instance type.
   2. Upgrade the cluster to Python 2.7 using the instructions
here<https://spark-project.atlassian.net/browse/SPARK-922?focusedCommentId=15711&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15711>
   .
   3. pip2.7 install numpy
   4. Run this script in the pyspark shell:

   wikistat = sc.textFile('s3n://ACCESSKEY:SECRET@bigdatademo
   /sample/wiki/pagecounts-20100212-050000.gz')
   wikistat = wikistat.map(lambda x: x.split(' ')).cache()
   wikistat.map(lambda x: (x[1], int(x[3]))).map(lambda x:
   (x[1],x[0])).sortByKey(False).take(5)

   5. You will see a long error output that includes a complaint about
   NumPy not being installed.
   6. Now remove the sortByKey() from that last line and rerun it.

   wikistat.map(lambda x: (x[1], int(x[3]))).map(lambda x:
   (x[1],x[0])).take(5)

   You should see your results without issue. So it's the sortByKey()
   that's choking.
   7. Quit the pyspark shell and pip uninstall numpy.
   8. Rerun the three lines from step 4. Enjoy your sorted results
   error-free.

Can anyone else reproduce this issue? Is it a bug? I don't see it if I
leave the cluster on the default Python 2.6.8.

Installing numpy on the slave via pssh and pip2.7 (so that it's identical
to the master) does not fix the issue. Dunno if installing Python packages
everywhere is even necessary though.

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Python-2-7-numpy-break-sortByKey-tp2214.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to