The difference between your two jobs is that take() is optimized and only runs on the machine where you are using the shell, whereas sortByKey requires using many machines. It seems like maybe python didn't get upgraded correctly on one of the slaves. I would look in the /root/spark/work/ folder (find the most recent application log) on each slave and see which slave is logging the error message.
On Wed, Mar 5, 2014 at 9:02 AM, Nicholas Chammas <nicholas.cham...@gmail.com> wrote: > Devs? Is this an issue for you that deserves a ticket? > > > On Sun, Mar 2, 2014 at 4:32 PM, Nicholas Chammas > <nicholas.cham...@gmail.com> wrote: >> >> So this issue appears to be related to the other Python 2.7-related issue >> I reported in this thread. >> >> Shall I open a bug in JIRA about this and include the wikistat repro? >> >> Nick >> >> >> On Sun, Mar 2, 2014 at 1:50 AM, nicholas.chammas >> <nicholas.cham...@gmail.com> wrote: >>> >>> Unexpected behavior. Here's the repro: >>> >>> Launch an EC2 cluster with spark-ec2. 1 slave; default instance type. >>> Upgrade the cluster to Python 2.7 using the instructions here. >>> pip2.7 install numpy >>> Run this script in the pyspark shell: >>> >>> wikistat = >>> sc.textFile('s3n://ACCESSKEY:SECRET@bigdatademo/sample/wiki/pagecounts-20100212-050000.gz') >>> wikistat = wikistat.map(lambda x: x.split(' ')).cache() >>> wikistat.map(lambda x: (x[1], int(x[3]))).map(lambda x: >>> (x[1],x[0])).sortByKey(False).take(5) >>> >>> You will see a long error output that includes a complaint about NumPy >>> not being installed. >>> Now remove the sortByKey() from that last line and rerun it. >>> >>> wikistat.map(lambda x: (x[1], int(x[3]))).map(lambda x: >>> (x[1],x[0])).take(5) >>> >>> You should see your results without issue. So it's the sortByKey() that's >>> choking. >>> Quit the pyspark shell and pip uninstall numpy. >>> Rerun the three lines from step 4. Enjoy your sorted results error-free. >>> >>> Can anyone else reproduce this issue? Is it a bug? I don't see it if I >>> leave the cluster on the default Python 2.6.8. >>> >>> Installing numpy on the slave via pssh and pip2.7 (so that it's identical >>> to the master) does not fix the issue. Dunno if installing Python packages >>> everywhere is even necessary though. >>> >>> Nick >>> >>> >>> ________________________________ >>> View this message in context: Python 2.7 + numpy break sortByKey() >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> >