I don't experiment it. That's the use-case in theory I could think of. ^^
However, from what I saw, BFGS converges really fast so that I only
need 20~30 iterations in general.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: htt
Have you experimented with it ? For logistic regression at least given
enough iterations/tolerance that you are giving, BFGS in both ways should
converge to same solution
On Tue, Apr 8, 2014 at 4:19 PM, DB Tsai wrote:
> I think mini batch is still useful for L-BFGS.
>
> One of the use-cases
I think mini batch is still useful for L-BFGS.
One of the use-cases can be initialized the weights by training with
the smaller subsamples of data using mini batch with L-BFGS.
Then we could use the weights trained with mini batch to start another
training process with full data.
Sincerely,
DB
Hi guys,
We're going to hold a series of meetups about machine learning with
Spark in San Francisco.
The first one will be on April 24. Xiangrui Meng from Databricks will
talk about Spark, Spark/Python, features engineering, and MLlib.
See http://www.meetup.com/sfmachinelearning/events/174560212
Yup that's what I expected...L-BFGS solver is in the master and gradient
computation per RDD is done on each of the workers...
This miniBatchFraction is also a heuristic which I don't think makes sense
for LogisticRegressionWithBFGS...does it ?
On Tue, Apr 8, 2014 at 3:44 PM, DB Tsai wrote:
>
Hi Debasish,
The L-BFGS solver will be in the master like GD solver, and the part
that is parallelized is computing the gradient of each input row, and
summing them up.
I prefer to make the optimizer plug-able instead of adding new
LogisticRegressionWithLBFGS since 98% of the code will be the sam
Nick and Koert summarized it pretty well. Just to clarify and give some
concrete examples.
If you want to start with a specific vertex, and follow some path, it is
probably easier and faster to use some key values store or even MySQL or a
graph database.
If you want to count the average length of
Likely neither will give real-time for full-graph traversal, no. And once
in memory, GraphX would definitely be faster for "breadth-first" traversal.
But for "vertex-centric" traversals (starting from a vertex and traversing
edges from there, such as "friends of friends" queries etc) then Titan is
it all depends on what kind of traversing. if its point traversing then a
random access based something would be great.
if its more scan-like traversl then spark will fit
On Tue, Apr 8, 2014 at 4:56 PM, Evan Chan wrote:
> I doubt Titan would be able to give you traversal of billions of nodes i
I doubt Titan would be able to give you traversal of billions of nodes in
real-time either. In-memory traversal is typically much faster than
Cassandra-based tree traversal, even including in-memory caching.
On Tue, Apr 8, 2014 at 1:23 PM, Nick Pentreath wrote:
> GraphX, like Spark, will not t
Anurag,
There is another method called newAPIHadoopRDD that takes in a
Configuration object rather than a path. Give that a shot?
https://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext
On Tue, Apr 8, 2014 at 1:47 PM, Anurag wrote:
> andrew - yes, i am using th
andrew/nick,
thx for the input, got it to work:
sc.hadoopConfiguration.set("record.delimiter.regex",
"^[A-Za-z]{3},\\s\\d{2}\\s[A-Za-z]{3}.*")
:-)
-anurag
On Tue, Apr 8, 2014 at 1:47 PM, Anurag wrote:
> andrew - yes, i am using the PatternInputFormat from the blog post you
> referenced.
> I
andrew - yes, i am using the PatternInputFormat from the blog post you
referenced.
I know how to set the pattern in configuration while writing a MR job, how
do i do that from a spark shell?
-anurag
On Tue, Apr 8, 2014 at 1:41 PM, Andrew Ash wrote:
> Are you using the PatternInputFormat from
Seems like you need to initialise a regex pattern for that inputformat. How
is this done? Perhaps via a config option?
In which case you need to first create a hadoop configuration, set the
appropriate config option for the regex, and pass that into
newAPIHadoopFile.
On Tue, Apr 8, 2014 at 10:36
Are you using the PatternInputFormat from this blog post?
https://hadoopi.wordpress.com/2013/05/31/custom-recordreader-processing-string-pattern-delimited-records/
If so you need to set the pattern in the configuration before attempting to
read data with that InputFormat:
String regex = "^[A-Za-
Hi,
I am able to read a custom input format in spark.
scala> val inputRead = sc.newAPIHadoopFile("hdfs://
127.0.0.1/user/cloudera/date_dataset/
",classOf[io.reader.PatternInputFormat],classOf[org.apache.hadoop.io.LongWritable],classOf[org.apache.hadoop.io.Text])
However, doing a
inputRead.count()
GraphX, like Spark, will not typically be "real-time" (where by "real-time"
here I assume you mean of the order of a few 10s-100s ms, up to a few
seconds).
Spark can in some cases approach the upper boundary of this definition (a
second or two, possibly less) when data is cached in memory and the
Hi,
Is Graphx on top of Apache Spark, is able to process the large scale
distributed graph traversal and compute, in real time. What is the query
execution engine distributing the query on top of graphx and apache spark.
My typical use case is a large scale distributed graph traversal in real
time
Ha ha! nice try, sheepherder! ;-)
On Tue, Apr 8, 2014 at 12:37 PM, Matei Zaharia wrote:
> Shh, maybe I really wanted people to fix that one issue.
>
> On Apr 8, 2014, at 9:34 AM, Aaron Davidson wrote:
>
> > Matei's link seems to point to a specific starter project as part of the
> > starter lis
By the way these changes are needed in mllib.regression as well
Right now my usecases need BFGS support in logistic regression and MLOR so
we can focus on cleaning up the classification package first ?
On Tue, Apr 8, 2014 at 9:42 AM, Debasish Das wrote:
> Hi DB,
>
> Are we going to clean u
Hi DB,
Are we going to clean up the function:
class LogisticRegressionWithSGD private (
var stepSize: Double,
var numIterations: Int,
var regParam: Double,
var miniBatchFraction: Double)
extends GeneralizedLinearAlgorithm[LogisticRegressionModel] with
Serializable {
val gradi
Shh, maybe I really wanted people to fix that one issue.
On Apr 8, 2014, at 9:34 AM, Aaron Davidson wrote:
> Matei's link seems to point to a specific starter project as part of the
> starter list, but here is the list itself:
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20
Matei's link seems to point to a specific starter project as part of the
starter list, but here is the list itself:
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20Starter%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)
On Mon, Apr 7, 20
23 matches
Mail list logo