Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-04-08 Thread DB Tsai
I don't experiment it. That's the use-case in theory I could think of. ^^ However, from what I saw, BFGS converges really fast so that I only need 20~30 iterations in general. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: htt

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-04-08 Thread Debasish Das
Have you experimented with it ? For logistic regression at least given enough iterations/tolerance that you are giving, BFGS in both ways should converge to same solution On Tue, Apr 8, 2014 at 4:19 PM, DB Tsai wrote: > I think mini batch is still useful for L-BFGS. > > One of the use-cases

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-04-08 Thread DB Tsai
I think mini batch is still useful for L-BFGS. One of the use-cases can be initialized the weights by training with the smaller subsamples of data using mini batch with L-BFGS. Then we could use the weights trained with mini batch to start another training process with full data. Sincerely, DB

A series of meetups about machine learning with Spark in San Francisco

2014-04-08 Thread DB Tsai
Hi guys, We're going to hold a series of meetups about machine learning with Spark in San Francisco. The first one will be on April 24. Xiangrui Meng from Databricks will talk about Spark, Spark/Python, features engineering, and MLlib. See http://www.meetup.com/sfmachinelearning/events/174560212

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-04-08 Thread Debasish Das
Yup that's what I expected...L-BFGS solver is in the master and gradient computation per RDD is done on each of the workers... This miniBatchFraction is also a heuristic which I don't think makes sense for LogisticRegressionWithBFGS...does it ? On Tue, Apr 8, 2014 at 3:44 PM, DB Tsai wrote: >

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-04-08 Thread DB Tsai
Hi Debasish, The L-BFGS solver will be in the master like GD solver, and the part that is parallelized is computing the gradient of each input row, and summing them up. I prefer to make the optimizer plug-able instead of adding new LogisticRegressionWithLBFGS since 98% of the code will be the sam

Re: Apache Spark and Graphx for Real Time Analytics

2014-04-08 Thread Reynold Xin
Nick and Koert summarized it pretty well. Just to clarify and give some concrete examples. If you want to start with a specific vertex, and follow some path, it is probably easier and faster to use some key values store or even MySQL or a graph database. If you want to count the average length of

Re: Apache Spark and Graphx for Real Time Analytics

2014-04-08 Thread Nick Pentreath
Likely neither will give real-time for full-graph traversal, no. And once in memory, GraphX would definitely be faster for "breadth-first" traversal. But for "vertex-centric" traversals (starting from a vertex and traversing edges from there, such as "friends of friends" queries etc) then Titan is

Re: Apache Spark and Graphx for Real Time Analytics

2014-04-08 Thread Koert Kuipers
it all depends on what kind of traversing. if its point traversing then a random access based something would be great. if its more scan-like traversl then spark will fit On Tue, Apr 8, 2014 at 4:56 PM, Evan Chan wrote: > I doubt Titan would be able to give you traversal of billions of nodes i

Re: Apache Spark and Graphx for Real Time Analytics

2014-04-08 Thread Evan Chan
I doubt Titan would be able to give you traversal of billions of nodes in real-time either. In-memory traversal is typically much faster than Cassandra-based tree traversal, even including in-memory caching. On Tue, Apr 8, 2014 at 1:23 PM, Nick Pentreath wrote: > GraphX, like Spark, will not t

Re: reading custom input format in Spark

2014-04-08 Thread Andrew Ash
Anurag, There is another method called newAPIHadoopRDD that takes in a Configuration object rather than a path. Give that a shot? https://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext On Tue, Apr 8, 2014 at 1:47 PM, Anurag wrote: > andrew - yes, i am using th

Re: reading custom input format in Spark

2014-04-08 Thread Anurag
andrew/nick, thx for the input, got it to work: sc.hadoopConfiguration.set("record.delimiter.regex", "^[A-Za-z]{3},\\s\\d{2}\\s[A-Za-z]{3}.*") :-) -anurag On Tue, Apr 8, 2014 at 1:47 PM, Anurag wrote: > andrew - yes, i am using the PatternInputFormat from the blog post you > referenced. > I

Re: reading custom input format in Spark

2014-04-08 Thread Anurag
andrew - yes, i am using the PatternInputFormat from the blog post you referenced. I know how to set the pattern in configuration while writing a MR job, how do i do that from a spark shell? -anurag On Tue, Apr 8, 2014 at 1:41 PM, Andrew Ash wrote: > Are you using the PatternInputFormat from

Re: reading custom input format in Spark

2014-04-08 Thread Nick Pentreath
Seems like you need to initialise a regex pattern for that inputformat. How is this done? Perhaps via a config option? In which case you need to first create a hadoop configuration, set the appropriate config option for the regex, and pass that into newAPIHadoopFile. On Tue, Apr 8, 2014 at 10:36

Re: reading custom input format in Spark

2014-04-08 Thread Andrew Ash
Are you using the PatternInputFormat from this blog post? https://hadoopi.wordpress.com/2013/05/31/custom-recordreader-processing-string-pattern-delimited-records/ If so you need to set the pattern in the configuration before attempting to read data with that InputFormat: String regex = "^[A-Za-

reading custom input format in Spark

2014-04-08 Thread Anurag
Hi, I am able to read a custom input format in spark. scala> val inputRead = sc.newAPIHadoopFile("hdfs:// 127.0.0.1/user/cloudera/date_dataset/ ",classOf[io.reader.PatternInputFormat],classOf[org.apache.hadoop.io.LongWritable],classOf[org.apache.hadoop.io.Text]) However, doing a inputRead.count()

Re: Apache Spark and Graphx for Real Time Analytics

2014-04-08 Thread Nick Pentreath
GraphX, like Spark, will not typically be "real-time" (where by "real-time" here I assume you mean of the order of a few 10s-100s ms, up to a few seconds). Spark can in some cases approach the upper boundary of this definition (a second or two, possibly less) when data is cached in memory and the

Apache Spark and Graphx for Real Time Analytics

2014-04-08 Thread love2dishtech
Hi, Is Graphx on top of Apache Spark, is able to process the large scale distributed graph traversal and compute, in real time. What is the query execution engine distributing the query on top of graphx and apache spark. My typical use case is a large scale distributed graph traversal in real time

Re: Contributing to Spark

2014-04-08 Thread Michael Ernest
Ha ha! nice try, sheepherder! ;-) On Tue, Apr 8, 2014 at 12:37 PM, Matei Zaharia wrote: > Shh, maybe I really wanted people to fix that one issue. > > On Apr 8, 2014, at 9:34 AM, Aaron Davidson wrote: > > > Matei's link seems to point to a specific starter project as part of the > > starter lis

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-04-08 Thread Debasish Das
By the way these changes are needed in mllib.regression as well Right now my usecases need BFGS support in logistic regression and MLOR so we can focus on cleaning up the classification package first ? On Tue, Apr 8, 2014 at 9:42 AM, Debasish Das wrote: > Hi DB, > > Are we going to clean u

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-04-08 Thread Debasish Das
Hi DB, Are we going to clean up the function: class LogisticRegressionWithSGD private ( var stepSize: Double, var numIterations: Int, var regParam: Double, var miniBatchFraction: Double) extends GeneralizedLinearAlgorithm[LogisticRegressionModel] with Serializable { val gradi

Re: Contributing to Spark

2014-04-08 Thread Matei Zaharia
Shh, maybe I really wanted people to fix that one issue. On Apr 8, 2014, at 9:34 AM, Aaron Davidson wrote: > Matei's link seems to point to a specific starter project as part of the > starter list, but here is the list itself: > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20

Re: Contributing to Spark

2014-04-08 Thread Aaron Davidson
Matei's link seems to point to a specific starter project as part of the starter list, but here is the list itself: https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20Starter%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened) On Mon, Apr 7, 20