Re: Welcoming some new committers

2018-03-05 Thread Seth Hendrickson
Thanks all! :D

On Mon, Mar 5, 2018 at 9:01 AM, Bryan Cutler  wrote:

> Thanks everyone, this is very exciting!  I'm looking forward to working
> with you all and helping out more in the future.  Also, congrats to the
> other committers as well!!
>


Re: Regularized Logistic regression

2016-10-13 Thread Seth Hendrickson
Spark MLlib provides a cross-validation toolkit for selecting
hyperparameters. I think you'll find the documentation quite helpful:

http://spark.apache.org/docs/latest/ml-tuning.html#example-model-selection-via-cross-validation

There is actually a python example for logistic regression there. If you
still have questions after reading it, then please post back again.

Hope that helps.

On Thu, Oct 13, 2016 at 12:58 PM, aditya1702  wrote:

> Ok so I tried setting the regParam and tried lowering it. how do I evaluate
> which regParam is best. Do I have to to do it by trial and error. I am
> currently calculating the log_loss for the model. Is it good to find the
> best regparam value. here is my code:
>
> from math import exp,log
> #from pyspark.sql.functions import log
> epsilon = 1e-16
> def sigmoid_log_loss(w,x):
>   ans=float(1/(1+exp(-(w.dot(x.features)
>   if ans==0:
> ans=ans+epsilon
>   if ans==1:
> ans=ans-epsilon
>   log_loss=-((x.label)*log(ans)+(1-x.label)*log(1-ans))
>   return ((ans,x.label),log_loss)
>
> ---
> reg=0.02
> from pyspark.ml.classification import LogisticRegression
> lr=LogisticRegression(regParam=reg,maxIter=500,standardization=True,
> elasticNetParam=0.5)
> model=lr.fit(data_train_df)
>
> w=model.coefficients
> intercept=model.intercept
> data_predicted_df=data_val_df.map(lambda x:(sigmoid_log_loss(w,x)))
> log_loss=data_predicted_df.map(lambda x:x[1]).mean()
> print log_loss
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Regularized-Logistic-regression-
> tp19432p19444.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Feedback on MLlib roadmap process proposal

2017-01-19 Thread Seth Hendrickson
I think the proposal laid out in SPARK-18813 is well done, and I do think
it is going to improve the process going forward. I also really like the
idea of getting the community to vote on JIRAs to give some of them
priority - provided that we listen to those votes, of course. The biggest
problem I see is that we do have several active contributors and those who
want to help implement these changes, but PRs are reviewed rather
sporadically and I imagine it is very difficult for contributors to
understand why some get reviewed and some do not. The most important thing
we can do, given that MLlib currently has a very limited committer review
bandwidth, is to make clear issues that, if worked on, will definitely get
reviewed. A hard thing to do in open source, no doubt, but even if we have
to limit the scope of such issues to a very small subset, it's a gain for
all I think.

On a related note, I would love to hear some discussion on the higher level
goal of Spark MLlib (if this derails the original discussion, please let me
know and we can discuss in another thread). The roadmap does contain
specific items that help to convey some of this (ML parity with MLlib,
model persistence, etc...), but I'm interested in what the "mission" of
Spark MLlib is. We often see PRs for brand new algorithms which are
sometimes rejected and sometimes not. Do we aim to keep implementing more
and more algorithms? Or is our focus really, now that we have a reasonable
library of algorithms, to simply make the existing ones faster/better/more
robust? Should we aim to make interfaces that are easily extended for
developers to easily implement their own custom code (e.g. custom
optimization libraries), or do we want to restrict things to out-of-the box
algorithms? Should we focus on more flexible, general abstractions like
distributed linear algebra?

I was not involved in the project in the early days of MLlib when this
discussion may have happened, but I think it would be useful to either
revisit it or restate it here for some of the newer developers.

On Tue, Jan 17, 2017 at 3:38 PM, Joseph Bradley 
wrote:

> Hi all,
>
> This is a general call for thoughts about the process for the MLlib
> roadmap proposed in SPARK-18813.  See the section called "Roadmap process."
>
> Summary:
> * This process is about committers indicating intention to shepherd and
> review.
> * The goal is to improve visibility and communication.
> * This is fairly orthogonal to the SIP discussion since this proposal is
> more about setting release targets than about proposing future plans.
>
> Thanks!
> Joseph
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] 
>


Re: MLlib mission and goals

2017-01-31 Thread Seth Hendrickson
I agree with what Sean said about not supporting arbitrarily many
algorithms. I think the goal of MLlib should be to support only core
algorithms for machine learning. Ideally Spark ML provides a relatively
small set of algorithms that are heavily optimized, and also provides a
framework that makes it easy for users to extend and build their own
packages and algos when they need to. Spark ML is already quite good for
this. We have of course been doing a lot of work migrating to this new API,
and now that we are approaching full parity, it would be good to shift the
focus to performance as others have noted. Supporting a few algorithms that
perform very well is significantly better than supporting many algorithms
with moderate performance, IMO.

I also think a more complete, optimized distributed linear algebra library
would be a great asset, but it may be a more long term goal. A performance
framework for regression testing would be great, but keeping it up to date
is difficult.

Thanks for kicking this thread off Joseph!

On Tue, Jan 24, 2017 at 7:30 PM, Joseph Bradley 
wrote:

> *Re: performance measurement framework*
> We (Databricks) used to use spark-perf
> , but that was mainly for the
> RDD-based API.  We've now switched to spark-sql-perf
> , which does include some
> ML benchmarks despite the project name.  I'll see about updating the
> project README to document how to run MLlib tests.
>
>
> On Tue, Jan 24, 2017 at 6:02 PM, bradc  wrote:
>
>> I believe one of the higher level goals of Spark MLlib should be to
>> improve the efficiency of the ML algorithms that already exist. Currently
>> there ML has a reasonable coverage of the important core algorithms. The
>> work to get to feature parity for DataFrame-based API and model persistence
>> are also important.
>>
>> Apache Spark needs to use higher-level BLAS3 and LAPACK routines, instead
>> of BLAS1 & BLAS3. For a long time we've used the concept of compute
>> intensity (compute_intensity = FP_operations/Word) to help look at the
>> performance of the underling compute kernels (see the papers referenced
>> below). It has been proven in many implementations that performance,
>> scalability, and huge reduction in memory pressure can be achieved by using
>> higher-level BLAS3 or LAPACK routines in both single node as well as
>> distributed computations.
>>
>> I performed a survey of some of Apache Spark's ML algorithms.
>> Unfortunately most of the ML algorithms are implemented with BLAS1 or BLAS2
>> routines which have very low compute intensity. BLAS2 and BLAS1 routines
>> require a lot more memory bandwidth and will not achieve peak performance
>> on x86, GPUs, or any other processor.
>>
>> Apache Spark 2.1.0 ML routines & BLAS Routines
>>
>> ALS(Alternating Least Squares matrix factorization
>>
>>- BLAS2: _SPR, _TPSV
>>- BLAS1: _AXPY, _DOT, _SCAL, _NRM2
>>
>> Logistic regression classification
>>
>>- BLAS2: _GEMV
>>- BLAS1: _DOT, _SCAL
>>
>> Generalized linear regression
>>
>>- BLAS1: _DOT
>>
>> Gradient-boosted tree regression
>>
>>- BLAS1: _DOT
>>
>> GraphX SVD++
>>
>>- BLAS1: _AXPY, _DOT,_SCAL
>>
>> Neural Net Multi-layer Perceptron
>>
>>- BLAS3: _GEMM
>>- BLAS2: _GEMV
>>
>> Only the Neural Net Multi-layer Perceptron uses BLAS3 matrix multiply
>> (DGEMM). BTW the underscores are replaced by S, D, Z, C for (32-bit real,
>> 64-bit double, 32-bit complex, 64-bit complex operations; respectably).
>>
>> Refactoring the algorithms to use BLAS3 routines or higher level LAPACK
>> routines will require coding changes to use sub-block algorithms but the
>> performance benefits can be great.
>>
>> More at: https://blogs.oracle.com/BestPerf/entry/improving_algorithms
>> _in_spark_ml
>> Background:
>>
>> Brad Carlile. Parallelism, compute intensity, and data vectorization.
>> SuperComputing'93, November 1993.
>> 
>>
>> John McCalpin. 
>> 213876927_Memory_Bandwidth_and_Machine_Balance_in_Current_High_Performance_Computers
>> 1995
>> 
>>
>> --
>> View this message in context: Re: MLlib mission and goals
>> 
>> Sent from the Apache Spark Developers List mailing list archive
>>  at
>> Nabble.com.
>>
>>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] 
>


Re: New to dev community | Contribution to Mlib

2017-09-20 Thread Seth Hendrickson
I'm not exactly clear on what you're proposing, but this sounds like
something that would live as a Spark package - a framework for anomaly
detection built on Spark. If there is some specific algorithm you have in
mind, it would be good to propose it on JIRA and discuss why you think it
needs to be included in Spark and not live as a Spark package.

In general, there will probably be resistance to including new algorithms
in Spark ML, especially until the ML package has reached full parity with
MLlib. Still, if you can provide more details that will help to understand
what is best here.

On Thu, Sep 14, 2017 at 1:29 AM, Venali Sonone  wrote:

>
> Hello,
>
> I am new to dev community of Spark and also open source in general but
> have used Spark extensively.
> I want to create a complete part on anomaly detection in spark Mlib,
> For the same I want to know if someone could guide me so i can start the
> development and contribute to Spark Mlib.
>
> Sorry for sounding naive if i do but any help is appreciated.
>
> Cheers!
> -venna
>
>