Re: java.net.SocketException on reduceByKey() in pyspark

2014-03-19 Thread Uri Laserson
gt;>> akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at >>> akka.actor.ActorCell.invoke(ActorCell.scala:456) at >>> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at >>> akka.dispatch.Mailbox.run(Mailbox.scala:219) at >>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) >>> at >>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at >>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >>> at >>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at >>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>> >>> >>> >>> >>> The lambda passed to flatMap() returns a list of tuples; take() works >>> fine just on the flatMap(). >>> >>> Where would I start to troubleshoot this error? >>> >>> The error output includes mention of reset connections, so I naively >>> confirmed that the master node can reach its 1 slave. Dunno if those are >>> related things. >>> >>> If it matters any, I upgraded the cluster to Python 2.7 using the >>> instructions here <https://spark-project.atlassian.net/browse/SPARK-922>. >>> Also, I am running Spark 0.9.0, though I notice that in the error output is >>> mention of 0.8.1 files. >>> >>> Nick >>> >>> -- >>> View this message in context: java.net.SocketException on reduceByKey() >>> in >>> pyspark<http://apache-spark-user-list.1001560.n3.nabble.com/java-net-SocketException-on-reduceByKey-in-pyspark-tp2184.html> >>> Sent from the Apache Spark User List mailing list >>> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com. >>> >> >> > -- Uri Laserson, PhD Data Scientist, Cloudera Twitter/GitHub: @laserson +1 617 910 0447 laser...@cloudera.com

Access original filename in a map function

2014-03-18 Thread Uri Laserson
union them together. But listing the files in HDFS is a bit annoying through Python, so I was wondering if the filename is somehow attached to a partition. Thanks! Uri -- Uri Laserson, PhD Data Scientist, Cloudera Twitter/GitHub: @laserson +1 617 910 0447 laser...@cloudera.com