RDD lineage and broadcast variables

2014-12-12 Thread Ron Ayoub
I'm still wrapping my head around that fact that the data backing an RDD is immutable since an RDD may need to be reconstructed from its lineage at any point. In the context of clustering there are many iterations where an RDD may need to change (for instance cluster assignments, etc) based on a

RE: Modifying an RDD in forEach

2014-12-06 Thread Ron Ayoub
ze the functions in that DAG to create minimal materialization on its way to final output. Regards Mayur On 06-Dec-2014 6:12 pm, "Ron Ayoub" wrote: This is from a separate thread with a differently named title. Why can't you modify the actual contents of an RDD using forEach?

RE: Java RDD Union

2014-12-06 Thread Ron Ayoub
; inherently take up resource, unless you mark them to be persisted. > You're paying the cost of copying objects to create one RDD from next, > but that's mostly it. > > On Sat, Dec 6, 2014 at 6:28 AM, Ron Ayoub wrote: > > With that said, and the nature of iterative

Modifying an RDD in forEach

2014-12-06 Thread Ron Ayoub
This is from a separate thread with a differently named title. Why can't you modify the actual contents of an RDD using forEach? It appears to be working for me. What I'm doing is changing cluster assignments and distances per data item for each iteration of the clustering algorithm. The cluster

RE: Java RDD Union

2014-12-06 Thread Ron Ayoub
with the Java > objects inside an RDD when you get a reference to them in a method > like this. This is definitely a bad idea, as there is certainly no > guarantee that any other operations will see any, some or all of these > edits. > > On Fri, Dec 5, 2014 at 2:40 PM, Ron A

Java RDD Union

2014-12-05 Thread Ron Ayoub
I'm a bit confused regarding expected behavior of unions. I'm running on 8 cores. I have an RDD that is used to collect cluster associations (cluster id, content id, distance) for internal clusters as well as leaf clusters since I'm doing hierarchical k-means and need all distances for sorting d

how do you turn off info logging when running in local mode

2014-12-04 Thread Ron Ayoub
I have not yet gotten to the point of running standalone. In my case I'm still working on the initial product and I'm running directly in Eclipse and I've compiled using the Spark maven project since the downloadable spark binaries require Hadoop. With that said, I'm running fine and I have thin

RE: collecting fails - requirements for collecting (clone, hashCode etc?)

2014-12-03 Thread Ron Ayoub
I didn't realize I do get a nice stack trace if not running in debug mode. Basically, I believe Document has to be serializable. But since the question has already been asked, are the other requirements for objects within an RDD that I should be aware of. serializable is very understandable. Ho

collecting fails - requirements for collecting (clone, hashCode etc?)

2014-12-03 Thread Ron Ayoub
The following code is failing on the collect. If I don't do the collect and go with a JavaRDD it works fine. Except I really would like to collect. At first I was getting an error regarding JDI threads and an index being 0. Then it just started locking up. I'm running the spark context locally o

Example of Fold

2014-10-31 Thread Ron Ayoub
I'm want to fold an RDD into a smaller RDD with max elements. I have simple bean objects with 4 properties. I want to group by 3 of the properties and then select the max of the 4th. So I believe fold is the appropriate method for this. My question is, is there a good fold example out there. Add

RE: winutils

2014-10-29 Thread Ron Ayoub
he Hadoop jars? On Wed, Oct 29, 2014 at 11:31 AM, Ron Ayoub wrote: Apparently Spark does require Hadoop even if you do not intend to use Hadoop. Is there a workaround for the below error I get when creating the SparkContext in Scala? I will note that I didn't have this problem yesterd

RE: winutils

2014-10-29 Thread Ron Ayoub
he Hadoop jars? On Wed, Oct 29, 2014 at 11:31 AM, Ron Ayoub wrote: Apparently Spark does require Hadoop even if you do not intend to use Hadoop. Is there a workaround for the below error I get when creating the SparkContext in Scala? I will note that I didn't have this problem yesterd

winutils

2014-10-29 Thread Ron Ayoub
Apparently Spark does require Hadoop even if you do not intend to use Hadoop. Is there a workaround for the below error I get when creating the SparkContext in Scala? I will note that I didn't have this problem yesterday when creating the Spark context in Java as part of the getting started App.

RE: Is Spark in Java a bad idea?

2014-10-28 Thread Ron Ayoub
I interpret this to mean you have to learn Scala in order to work with Spark in Scala (goes without saying) and also to work with Spark in Java (since you have to jump through some hoops for basic functionality). The best path here is to take this as a learning opportunity and sit down and learn

Is Spark in Java a bad idea?

2014-10-28 Thread Ron Ayoub
I haven't learned Scala yet so as you might imagine I'm having challenges working with Spark from the Java API. For one thing, it seems very limited in comparison to Scala. I ran into a problem really quick. I need to hydrate an RDD from JDBC/Oracle and so I wanted to use the JdbcRDD. But that i

JdbcRDD in Java

2014-10-28 Thread Ron Ayoub
The following line of code is indicating the constructor is not defined. The only examples I can find of usage of JdbcRDD is Scala examples. Does this work in Java? Is there any examples? Thanks. JdbcRDD rdd = new JdbcRDD(sp, () -> ods.getConnection(), sql, 1, 1783059, 1

RE: Spark to eliminate full-table scan latency

2014-10-27 Thread Ron Ayoub
om CC: user@spark.apache.org You can access cached data in spark through the JDBC server: http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server On Mon, Oct 27, 2014 at 1:47 PM, Ron Ayoub wrote: We have a table containing 25 features per item id along

Spark to eliminate full-table scan latency

2014-10-27 Thread Ron Ayoub
We have a table containing 25 features per item id along with feature weights. A correlation matrix can be constructed for every feature pair based on co-occurrence. If a user inputs a feature they can find out the features that are correlated with a self-join requiring a single full table scan.