I'm trying to run a forked version of mllib where I am experimenting with a
boosted trees implementation. Here is what I've tried, but can't seem to
get working properly:

*Directory layout:*

src/spark-dev  (spark github fork)
  pom.xml - I've tried changing the version to 1.2 arbitrarily in core and
mllib
src/forestry  (test driver)
  pom.xml - depends on spark-core and spark-mllib with version 1.2

*spark-defaults.conf:*

spark.master                    spark://
ec2-54-224-112-117.compute-1.amazonaws.com:7077
spark.verbose                   true
spark.files.userClassPathFirst  false  # I've tried both true and false here
spark.executor-memory           6G
spark.jars
 
spark-mllib_2.10-1.2.0-SNAPSHOT.jar,spark-core_2.10-1.2.0-SNAPSHOT.jar,spark-streaming_2.10-1.2.0-SNAPSHOT.jar

*Build and run script:*

MASTER=r...@ec2-54-224-112-117.compute-1.amazonaws.com
PRIMARY_JAR=forestry-main-1.0-SNAPSHOT-jar-with-dependencies.jar
FORESTRY_DIR=~/src/forestry-main
SPARK_DIR=~/src/spark-dev
cd $SPARK_DIR
mvn -T8 -DskipTests -pl core,mllib,streaming install
cd $FORESTRY_DIR
mvn -T8 -DskipTests package
rsync --progress
~/src/spark-dev/mllib/target/spark-mllib_2.10-1.2.0-SNAPSHOT.jar $MASTER:
rsync --progress
~/src/spark-dev/core/target/spark-core_2.10-1.2.0-SNAPSHOT.jar $MASTER:
rsync --progress
~/src/spark-dev/streaming/target/spark-streaming_2.10-1.2.0-SNAPSHOT.jar
$MASTER:
rsync --progress ~/src/forestry-main/target/$PRIMARY_JAR $MASTER:
rsync --progress ~/src/forestry-main/spark-defaults.conf $MASTER:spark/conf
ssh $MASTER "spark/bin/spark-submit $PRIMARY_JAR --class forestry.TreeTest
--verbose"

In spark-dev/mllib I've added a new class, GradientBoostingTree, which I'm
referencing from TreeTest in my test driver. The driver pulls some data
from s3, converts to LabeledPoint, and then calls
GradientBoostingTree.train(...) identically to how DecisionTree works. This
is all fine until it we call examples.map { x => tree.predict(x.features) }
where tree is a DecisionTree that I've also modified in my fork. At this
point, the workers blow up because they can't find a new method I've added
to the tree.model.Node class. My suspicion is that maybe the workers have
deserialized the DecisionTreeModel into a different version of mllib that
doesn't have my changes?

Is my setup all wrong? I'm using an EC2 cluster because it is so easy to
startup and manage, maybe I need to fully distribute my new version of
spark to all the workers before starting the job? Is there an easy way to
do that?

Reply via email to