Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-10-30 Thread slcclimber
Ashutosh, A vector would be a good idea vectors are used very frequently. Test data is usually stored in the spark/data/mllib folder On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers List]" < ml-node+s1001551n9034...@n3.nabble.com> wrote: > Hi Anant, > sorry for my late reply. Than

Re: matrix factorization cross validation

2014-10-30 Thread Nick Pentreath
Sean, re my point earlier do you know a more efficient way to compute top k for each user, other than to broadcast the item factors?  (I guess one can use the new asymmetric lsh paper perhaps to assist) — Sent from Mailbox On Thu, Oct 30, 2014 at 11:24 PM, Sean Owen wrote: > MAP is effectiv

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-10-30 Thread Ashutosh
A?lready done. Here is the link https://issues.apache.org/jira/browse/SPARK-4038 From: slcclimber [via Apache Spark Developers List] Sent: Friday, October 31, 2014 10:09 AM To: Ashutosh Trivedi (MT2013030) Subject: Re: [MLlib] Contributing Algorithm for Outlier

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-10-30 Thread Ashutosh
?Okay. I'll try it and post it soon with test case. After that I think we can go ahead with the PR. From: slcclimber [via Apache Spark Developers List] Sent: Friday, October 31, 2014 10:03 AM To: Ashutosh Trivedi (MT2013030) Subject: Re: [MLlib] Contributing Alg

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-10-30 Thread Ashutosh
Hi Anant, sorry for my late reply. Thank you for taking time and reviewing it. I have few comments on first issue. You are correct on the string (csv) part. But we can not take input of type you mentioned. We calculate frequency in our function. Otherwise user has to do all this computation. I r

Re: matrix factorization cross validation

2014-10-30 Thread Sean Owen
MAP is effectively an average over all k from 1 to min(# recommendations, # items rated) Getting first recommendations right is more important than the last. On Thu, Oct 30, 2014 at 10:21 PM, Debasish Das wrote: > Does it make sense to have a user specific K or K is considered same over > all use

Re: matrix factorization cross validation

2014-10-30 Thread Debasish Das
Does it make sense to have a user specific K or K is considered same over all users ? Intuitively the users who watches more movies should get a higher K than the others... On Thu, Oct 30, 2014 at 2:15 PM, Sean Owen wrote: > The pretty standard metric for recommenders is mean average precision,

Re: matrix factorization cross validation

2014-10-30 Thread Sean Owen
The pretty standard metric for recommenders is mean average precision, and RankingMetrics will already do that as-is. I don't know that a confusion matrix for this binary classification does much. On Thu, Oct 30, 2014 at 9:41 PM, Debasish Das wrote: > I am working on it...I will open up a JIRA o

Registering custom metrics

2014-10-30 Thread Gerard Maas
vHi, I've been exploring the metrics exposed by Spark and I'm wondering whether there's a way to register job-specific metrics that could be exposed through the existing metrics system. Would there be an example somewhere? BTW, documentation about how the metrics work could be improved. I found

Re: matrix factorization cross validation

2014-10-30 Thread Debasish Das
I am working on it...I will open up a JIRA once I see some results.. Idea is to come up with a test train set based on users...basically for each user, we come up with 80% train data and 20% test data... Now we pick up a K (each user should have a different K based on the movies he watched so som

Re: How to run tests properly?

2014-10-30 Thread Patrick Wendell
Some of our tests actually require spinning up a small multi-process spark cluster. These use the normal deployment codepath for Spark which is that we rely on the spark "assembly jar" to be present. That jar is generated when you run "mvn package" via a special sub project called assembly in our b

Re: best IDE for scala + spark development?

2014-10-30 Thread nm3mon
Multiline support (much shinier than :paste), smart completion and things that an IDE makes easy or better (without any hassle). In particular, fast switching between REPL and editor while staying in the same screen makes me even more productive. Nabeel > On Oct 30, 2014, at 9:39 AM, Stephen

Re: matrix factorization cross validation

2014-10-30 Thread Debasish Das
I thought topK will save us...for each user we have 1xrank...now our movie factor is a RDD...we pick topK movie factors based on vector norm...with K = 50, we will have 50 vectors * num_executors in a RDD...with the user 1xrank we do a distributed dot product using RowMatrix APIs... May be we can'

Re: best IDE for scala + spark development?

2014-10-30 Thread Stephen Boesch
HI Nabeel, In what ways is the IJ version of scala repl enhanced? thx! 2014-10-30 3:41 GMT-07:00 : > IntelliJ idea scala plugin comes with an enhanced REPL. It's a pretty > decent option too. > > Nabeel > > > On Oct 28, 2014, at 5:34 AM, Cheng Lian wrote: > > > > My two cents for Mac Vim/Emac

Re: How to run tests properly?

2014-10-30 Thread Sean Owen
You are right that this is a bit weird compared to the Maven lifecycle semantics. Maven wants assembly to come after tests but here tests want to launch the final assembly as part of some tests. Yes you would not normally have to do this in 2 stages. On Oct 30, 2014 12:28 PM, "Niklas Wilcke" <1wil.

Re: How to run tests properly?

2014-10-30 Thread Niklas Wilcke
Can you please briefly explain why packaging is necessary. I thought packaging would only build the jar and place it in the target folder. How does that affect the tests? If tests depend on the assembly a "mvn install" would be more sensible to me. Probably I misunderstand the maven build life-cycl

Re: best IDE for scala + spark development?

2014-10-30 Thread nm3mon
IntelliJ idea scala plugin comes with an enhanced REPL. It's a pretty decent option too. Nabeel > On Oct 28, 2014, at 5:34 AM, Cheng Lian wrote: > > My two cents for Mac Vim/Emacs users. Fixed a Scala ctags Mac compatibility > bug months ago, and you may want to use the most recent version he

Re: matrix factorization cross validation

2014-10-30 Thread Nick Pentreath
Looking at https://github.com/apache/spark/blob/814a9cd7fabebf2a06f7e2e5d46b6a2b28b917c2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala#L82 For each user in test set, you generate an Array of top K predicted item ids (Int or String probably), and an Array of ground tru