I'm trying to run a forked version of mllib where I am experimenting with a boosted trees implementation. Here is what I've tried, but can't seem to get working properly:
*Directory layout:* src/spark-dev (spark github fork) pom.xml - I've tried changing the version to 1.2 arbitrarily in core and mllib src/forestry (test driver) pom.xml - depends on spark-core and spark-mllib with version 1.2 *spark-defaults.conf:* spark.master spark:// ec2-54-224-112-117.compute-1.amazonaws.com:7077 spark.verbose true spark.files.userClassPathFirst false # I've tried both true and false here spark.executor-memory 6G spark.jars spark-mllib_2.10-1.2.0-SNAPSHOT.jar,spark-core_2.10-1.2.0-SNAPSHOT.jar,spark-streaming_2.10-1.2.0-SNAPSHOT.jar *Build and run script:* MASTER=r...@ec2-54-224-112-117.compute-1.amazonaws.com PRIMARY_JAR=forestry-main-1.0-SNAPSHOT-jar-with-dependencies.jar FORESTRY_DIR=~/src/forestry-main SPARK_DIR=~/src/spark-dev cd $SPARK_DIR mvn -T8 -DskipTests -pl core,mllib,streaming install cd $FORESTRY_DIR mvn -T8 -DskipTests package rsync --progress ~/src/spark-dev/mllib/target/spark-mllib_2.10-1.2.0-SNAPSHOT.jar $MASTER: rsync --progress ~/src/spark-dev/core/target/spark-core_2.10-1.2.0-SNAPSHOT.jar $MASTER: rsync --progress ~/src/spark-dev/streaming/target/spark-streaming_2.10-1.2.0-SNAPSHOT.jar $MASTER: rsync --progress ~/src/forestry-main/target/$PRIMARY_JAR $MASTER: rsync --progress ~/src/forestry-main/spark-defaults.conf $MASTER:spark/conf ssh $MASTER "spark/bin/spark-submit $PRIMARY_JAR --class forestry.TreeTest --verbose" In spark-dev/mllib I've added a new class, GradientBoostingTree, which I'm referencing from TreeTest in my test driver. The driver pulls some data from s3, converts to LabeledPoint, and then calls GradientBoostingTree.train(...) identically to how DecisionTree works. This is all fine until it we call examples.map { x => tree.predict(x.features) } where tree is a DecisionTree that I've also modified in my fork. At this point, the workers blow up because they can't find a new method I've added to the tree.model.Node class. My suspicion is that maybe the workers have deserialized the DecisionTreeModel into a different version of mllib that doesn't have my changes? Is my setup all wrong? I'm using an EC2 cluster because it is so easy to startup and manage, maybe I need to fully distribute my new version of spark to all the workers before starting the job? Is there an easy way to do that?