Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-20 Thread slcclimber
hat you can add code in > it. > > > > > > Thanks, > > > > Ashutosh > > > > > > From: slcclimber [via Apache Spark Developers List] < > > [hidden email] <http://user/SendEmail.jtp?type=node&node=9467&i=1>> > > Sent: Thurs

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-20 Thread Joseph Bradley
add code in it. > > > Thanks, > > Ashutosh > > > From: slcclimber [via Apache Spark Developers List] < > ml-node+s1001551n9441...@n3.nabble.com> > Sent: Thursday, November 20, 2014 7:49 AM > To: Ashutosh Trivedi (MT2013030) >

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-19 Thread Ashutosh
Algorithm for Outlier Detection You could also use rdd.zipWithIndex() to create indexes. Anant If you reply to this email, your message will be added to the discussion below: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-19 Thread slcclimber
You could also use rdd.zipWithIndex() to create indexes. Anant -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9441.html Sent from the Apache Spark Developers List mailing list archive at Nabble

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-18 Thread Ashutosh
List] Sent: Monday, November 17, 2014 10:45 AM To: Ashutosh Trivedi (MT2013030) Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection Ashutosh, The counter will certainly be an parellization issue when multiple nodes are used specially over massive datasets. A better approach would

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-16 Thread slcclimber
Ashutosh, The counter will certainly be an parellization issue when multiple nodes are used specially over massive datasets. A better approach would be to use some thing along these lines: val index = sc.parallelize(Range.Long(0, rdd.count, 1), rdd.partitions.size) val rddWithIndex = rdd.z

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-13 Thread Ashutosh
} From: Meethu Mathew-2 [via Apache Spark Developers List] Sent: Friday, November 14, 2014 11:42 AM To: Ashutosh Trivedi (MT2013030) Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection Hi, I have a doubt regarding the input to your algorithm.

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-13 Thread Meethu Mathew
PM To: Ashutosh Trivedi (MT2013030) Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection Mayur, Libsvm format sounds good to me. I could work on writing the tests if that helps you? Anant On Nov 11, 2014 11:06 AM, "Ashutosh [via Apache Spark Developers List]" <[hid

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-13 Thread Meethu Mathew
opers List] Sent: Tuesday, November 11, 2014 11:46 PM To: Ashutosh Trivedi (MT2013030) Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection Mayur, Libsvm format sounds good to me. I could work on writing the tests if that helps you? Anant On Nov 11, 2014 11:06 AM, "Ashut

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-13 Thread Ashutosh
M To: Ashutosh Trivedi (MT2013030) Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection Mayur, Libsvm format sounds good to me. I could work on writing the tests if that helps you? Anant On Nov 11, 2014 11:06 AM, "Ashutosh [via Apache Spark Developers List]" <[hidden e

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-11 Thread Ashutosh
=node&node=9286&i=0>> Sent: Saturday, November 8, 2014 12:52 PM To: Ashutosh Trivedi (MT2013030) Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection > > We should take a vector instead giving the user flexibility to decide > data source/ type What do you mean

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-11 Thread slcclimber
rom:* Mayur Rustagi [via Apache Spark Developers List] email] <http://user/SendEmail.jtp?type=node&node=9286&i=0>> > *Sent:* Saturday, November 8, 2014 12:52 PM > *To:* Ashutosh Trivedi (MT2013030) > *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detectio

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-11 Thread Ashutosh
Sent: Saturday, November 8, 2014 12:52 PM To: Ashutosh Trivedi (MT2013030) Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection > > We should take a vector instead giving the user flexibility to decide > data source/ type What do you mean by vector datatype exactly? Mayur

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-07 Thread Mayur Rustagi
p;i=0>> wrote: > > > >> ​Okay. I'll try it and post it soon with test case. After that I think > >> we can go ahead with the PR. > >> -- > >> *From:* slcclimber [via Apache Spark Developers List] >> email] &

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-04 Thread slcclimber
> *From:* slcclimber [via Apache Spark Developers List] email] <http://user/SendEmail.jtp?type=node&node=9083&i=0>> > *Sent:* Friday, October 31, 2014 10:09 AM > *To:* Ashutosh Trivedi (MT2013030) > *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection > > >

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-04 Thread Ashutosh
Ashutosh Trivedi (MT2013030) Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection You should create a jira ticket to go with it as well. Thanks On Oct 30, 2014 10:38 PM, "Ashutosh [via Apache Spark Developers List]" <[hidden email]> wrote: ?Okay. I'll try it a

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-10-30 Thread slcclimber
Ashutosh, A vector would be a good idea vectors are used very frequently. Test data is usually stored in the spark/data/mllib folder On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers List]" < ml-node+s1001551n9034...@n3.nabble.com> wrote: > Hi Anant, > sorry for my late reply. Than

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-10-30 Thread Ashutosh
go ahead with the PR. From: slcclimber [via Apache Spark Developers List] http://user/SendEmail.jtp?type=node&node=9036&i=0>> Sent: Friday, October 31, 2014 10:03 AM To: Ashutosh Trivedi (MT2013030) Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection Ashu

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-10-30 Thread Ashutosh
uting Algorithm for Outlier Detection Ashutosh, A vector would be a good idea vectors are used very frequently. Test data is usually stored in the spark/data/mllib folder On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers List]" <[hidden email]> wrote: Hi Anant, so

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-10-30 Thread Ashutosh
Hi Anant, sorry for my late reply. Thank you for taking time and reviewing it. I have few comments on first issue. You are correct on the string (csv) part. But we can not take input of type you mentioned. We calculate frequency in our function. Otherwise user has to do all this computation. I r

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-10-28 Thread slcclimber
Ashu, There is one main issue and a few stylistic/ grammatical things I noticed. 1> You take and rdd or type String which you expect to be comma separated. This limits usability since the user will have to convert their RDD to that format only for you to split it on string. It would make more sens

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-10-28 Thread Ashutosh
Hi Anant, Thank you for reviewing and helping us out. Please find the following link where you can see the initial code. https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala The input file for the code should be in csv format. We have provided a data

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-10-24 Thread Ashutosh
Hi, We are ready with the initial code. Where can I submit it for review ? I want to get it reviewed before testing it at scale. Also, I see that most of the algorithms take data as RDD[LabeledPoint] . How should we take input for this since there are no labels. Can any body help me out with thes

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-10-21 Thread Ashutosh
Hi Xiangrui, Thanks for the reply. AVF is not so difficult to implement in parallel. It just calculate the frequency of each attribute and calculate the overall 'score' of the datapoint. Low score points are considered outlier. One advantage of it is that it does not calculate distance, so in that

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-10-21 Thread Xiangrui Meng
Hi Ashutosh, The process you described is correct, with details documented in https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark . There is no outlier detection algorithm in MLlib. Before you start coding, please open an JIRA and let's discuss which algorithms are appropriate