re: Intro to using IntelliJ to debug SPARK-1.1 Apps with mvn/sbt (for beginners)

2014-11-18 Thread Yiming (John) Zhang
Hi Chester, thank you for your reply. But I tried this approach and it failed. It seems that there are more difficulty using sbt in IntelliJ than expected. And according to some references "# sbt/sbt gen-idea" is not necessary (after Spark-1.0.0?), you can simply import the spark project and Intel

Re: Apache infra github sync down

2014-11-18 Thread Reynold Xin
This basically stops us from merging patches. I'm wondering if it is possible for ASF to give some Spark committers write permission to github repo. In that case, if the sync tool is down, we can manually push periodically. On Tue, Nov 18, 2014 at 10:24 PM, Patrick Wendell wrote: > Hey All, > >

Apache infra github sync down

2014-11-18 Thread Patrick Wendell
Hey All, The Apache-->github mirroring is not working right now and hasn't been working fo more than 24 hours. This means that pull requests will not appear as closed even though they have been merged. It also causes diffs to display incorrectly in some cases. If you'd like to follow progress by A

Re: Intro to using IntelliJ to debug SPARK-1.1 Apps with mvn/sbt (for beginners)

2014-11-18 Thread Chester @work
For sbt You can simplify run sbt/sbt gen-idea To generate the IntelliJ idea project module for you. You can the just open the generated project, which includes all the needed dependencies Sent from my iPhone > On Nov 18, 2014, at 8:26 PM, Chen He wrote: > > Thank you Yiming. It is helpful.

Re: Intro to using IntelliJ to debug SPARK-1.1 Apps with mvn/sbt (for beginners)

2014-11-18 Thread Chen He
Thank you Yiming. It is helpful. Regards! Chen On Tue, Nov 18, 2014 at 8:00 PM, Yiming (John) Zhang wrote: > Hi, > > > > I noticed it is hard to find a thorough introduction to using IntelliJ to > debug SPARK-1.1 Apps with mvn/sbt, which is not straightforward for > beginners. So I spent sever

Intro to using IntelliJ to debug SPARK-1.1 Apps with mvn/sbt (for beginners)

2014-11-18 Thread Yiming (John) Zhang
Hi, I noticed it is hard to find a thorough introduction to using IntelliJ to debug SPARK-1.1 Apps with mvn/sbt, which is not straightforward for beginners. So I spent several days to figure it out and hope that it would be helpful for beginners like me and that professionals can help me improv

Re: Quantile regression in tree models

2014-11-18 Thread Manish Amde
Hi Alex, Here is the ticket for refining tree predictions. Let's discuss this further on the JIRA. https://issues.apache.org/jira/browse/SPARK-4240 There is no ticket yet for quantile regression. It will be great if you could create one and note down the corresponding loss function and gradient c

Re: Implementing TinkerPop on top of GraphX

2014-11-18 Thread Kyle Ellrott
The new Tinkerpop3 API was different enough from V2, that it was worth starting a new implementation rather then trying to completely refactor my old code. I've started a new project: https://github.com/kellrott/spark-gremlin which compiles and runs the first set of unit tests (which it completely

Re: Using sampleByKey

2014-11-18 Thread Xiangrui Meng
If all users are equally important, then the average score should be representative. You shouldn't worry about missing one or two. For stratified sampling, wikipedia has a paragraph about its disadvantage: http://en.wikipedia.org/wiki/Stratified_sampling#Disadvantages It depends on the size of th

Re: Using sampleByKey

2014-11-18 Thread Debasish Das
For mllib PR, I will add this logic: "If a user is missing in training and appears in test, we can simply ignore it." I was struggling since users appear in test on which the model was not trained on... For our internal tests we want to cross validate on every product / user as all of them are eq

Re: Using sampleByKey

2014-11-18 Thread Xiangrui Meng
`sampleByKey` with the same fraction per stratum acts the same as `sample`. The operation you want is perhaps `sampleByKeyExact` here. However, when you use stratified sampling, there should not be many strata. My question is why we need to split on each user's ratings. If a user is missing in trai

Re: Quantile regression in tree models

2014-11-18 Thread Alessandro Baretta
Manish, My use case for (asymmetric) absolute error is quite trivially quantile regression. In other words, I want to use Spark to learn conditional cumulative distribution functions. See R's GBM quantile regression option. If you either find or create a Jira ticket, I would be happy to give it a

Re: Using sampleByKey

2014-11-18 Thread Debasish Das
Sean, I thought sampleByKey (stratified sampling) in 1.1 was designed to solve the problem that randomSplit can't sample by key... Xiangrui, What's the expected behavior of sampleByKey ? In the dataset sampled using sampleByKey the keys should match the input dataset keys right ? If it is a bug,

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-18 Thread Ashutosh
Hi Anant, I have removed the counter and all possible side effects. Now I think we can go ahead with the testing. I have created another folder for testing. I will add you as a collaborator in github . _Ashutosh From: slcclimber [via Apache Spark Developers L

Re: Using sampleByKey

2014-11-18 Thread Sean Owen
I use randomSplit to make a train/CV/test set in one go. It definitely produces disjoint data sets and is efficient. The problem is you can't do it by key. I am not sure why your subtract does not work. I suspect it is because the values do not partition the same way, or they don't evaluate equali