Different classpath across stages?

2015-11-11 Thread John Meehan
I’ve been running into a strange class not found problem, but only when my job has more than one phase. I have an RDD[ProtobufClass] which behaves as expected in a single-stage job (e.g. serialize to JSON and export). But when I try to groupByKey, the first stage runs (essentially a keyBy), bu

PySpark on YARN "port out of range"

2015-06-19 Thread John Meehan
Has anyone encountered this “port out of range” error when launching PySpark jobs on YARN? It is sporadic (e.g. 2/3 jobs get this error). LOG: 15/06/19 11:49:44 INFO scheduler.TaskSetManager: Lost task 0.3 in stage 39.0 (TID 211) on executor xxx.xxx.xxx.com : java.lan

Re: Joining data using Latitude, Longitude

2015-03-10 Thread John Meehan
There are some techniques you can use If you geohash the lat-lngs. They will naturally be sorted by proximity (with some edge cases so watch out). If you go the join route, either by trimming the lat-lngs or geohashing them, you’re essentially grouping n

Re: Getting to proto buff classes in Spark Context

2015-02-28 Thread John Meehan
Maybe try including the jar with --driver-class-path > On Feb 26, 2015, at 12:16 PM, Akshat Aranya wrote: > > My guess would be that you are packaging too many things in your job, which > is causing problems with the classpath. When your jar goes in first, you get > the correct version o

Re: Spark and Play

2014-11-11 Thread John Meehan
You can also build a Play 2.2.x + Spark 1.1.0 fat jar with sbt-assembly for, e.g. yarn-client support or using with spark-shell for debugging: play.Project.playScalaSettings libraryDependencies ~= { _ map { case m if m.organization == "com.typesafe.play" => m.exclude("commons-logging", "com