Re: how to implement my own datasource?

2015-06-25 Thread jimfcarroll
I'm not sure if this is what you're looking for but we have several custom RDD implementations for internal data format/partitioning schemes. The Spark api is really simple and consists primarily of being able to implement 3 simple things: 1) You need a class that extends RDD that's lightweight

Re: Problem with version compatibility

2015-06-25 Thread jimfcarroll
Yana and Sean, Thanks for the feedback. I can get it to work a number of ways, I'm just wondering if there's a preferred means. One last question. Is there a reason the deployed Spark install doesn't contain the same version of several classes as the maven dependency. Is this intentional? Thank

Re: Problem with version compatibility

2015-06-25 Thread jimfcarroll
Ah. I've avoided using spark-submit primarily because our use of Spark is as part of an analytics library that's meant to be embedded in other applications with their own lifecycle management. One of those application is a REST app running in tomcat which will make the use of spark-submit difficul

Re: Problem with version compatibility

2015-06-25 Thread jimfcarroll
Hi Sean, I'm packaging spark with my (standalone) driver app using maven. Any assemblies that are used on the mesos workers through extending the classpath or providing the jars in the driver (via the SparkConf) isn't packaged with spark (it seems obvious that would be a mistake). I need, for exa

Re: Problem with version compatibility

2015-06-24 Thread jimfcarroll
Hi Sean, I'm running a Mesos cluster. My driver app is built using maven against the maven 1.4.0 dependency. The Mesos slave machines have the spark distribution installed from the distribution link. I have a hard time understanding how this isn't a standard app deployment but maybe I'm missing

Re: Problem with version compatibility

2015-06-24 Thread jimfcarroll
These jars are simply incompatible. You can see this by looking at that class in both the maven repo for 1.4.0 here: http://central.maven.org/maven2/org/apache/spark/spark-core_2.10/1.4.0/spark-core_2.10-1.4.0.jar as well as the spark-assembly jar inside the .tgz file you can get from the officia

Problem with version compatibility

2015-06-24 Thread jimfcarroll
Hello all, I have a strange problem. I have a mesos spark cluster with Spark 1.4.0/Hadoop 2.4.0 installed and a client application use maven to include the same versions. However, I'm getting a serialUIDVersion problem on: ERROR Remoting - org.apache.spark.storage.BlockManagerMessages$RegisterB

Re: GradientBoostTrees leaks a persisted RDD

2015-04-23 Thread jimfcarroll
Okay. PR: https://github.com/apache/spark/pull/5669 Jira: https://issues.apache.org/jira/browse/SPARK-7100 Hope that helps. Let me know if you need anything else. Jim -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/GradientBoostTrees-leaks-a-per

Re: GradientBoostTrees leaks a persisted RDD

2015-04-23 Thread jimfcarroll
Hi Sean and Joe, I have another question. GradientBoostedTrees.run iterates over the RDD calling DecisionTree.run on each iteration with a new random sample from the input RDD. DecisionTree.run calls RandomForest.run. which also calls persist. One of these seems superfluous. Should I simply re

Re: GradientBoostTrees leaks a persisted RDD

2015-04-23 Thread jimfcarroll
Hi Joe, Do you want a PR per branch (one for master, one for 1.3)? Are you still maintaining 1.2? Do you need a Jira ticket per PR or can I submit them all under the same ticket? Or should I just submit it to master and let you guys back-port it? Jim -- View this message in context: http://

GradientBoostTrees leaks a persisted RDD

2015-04-22 Thread jimfcarroll
Hi all, It appears GradientBoostedTrees.scala can call 'persist' on an RDD and never unpersist it. In the master branch it's here: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala#L181 In 1.3.1 it's here: https://github.com/

Re: RDD.count

2015-03-28 Thread jimfcarroll
Hello all, I worked around this for now using the class (that I already had) that inherits from RDD and is the one all of our custom RDDs inherit from. I did the following: 1) Overload all of the transformations (that get used in our app) that don't change the RDD size wrapping the results with a

Re: RDD.count

2015-03-28 Thread jimfcarroll
Hi Sean, Thanks for the response. I can't imagine a case (though my imagination may be somewhat limited) where even map side effects could change the number of elements in the resulting map. I guess "count" wouldn't officially be an 'action' if it were implemented this way. At least it wouldn't

RDD.count

2015-03-27 Thread jimfcarroll
Hi all, I was wondering why the RDD.count call recomputes the RDD in all cases? In most cases it can simply ask the next dependent RDD. I have several RDD implementations and was surprised to see a call like the following never call my RDD's count method but instead recompute/traverse the entire d