Re: Using sampleByKey

2014-11-18 Thread Sean Owen
I use randomSplit to make a train/CV/test set in one go. It definitely
produces disjoint data sets and is efficient. The problem is you can't
do it by key.

I am not sure why your subtract does not work. I suspect it is because
the values do not partition the same way, or they don't evaluate
equality in the expected way, but I don't see any reason why. Tuples
work as expected here.

On Tue, Nov 18, 2014 at 4:32 AM, Debasish Das  wrote:
> Hi,
>
> I have a rdd whose key is a userId and value is (movieId, rating)...
>
> I want to sample 80% of the (movieId,rating) that each userId has seen for
> train, rest is for test...
>
> val indexedRating = sc.textFile(...).map{x=> Rating(x(0), x(1), x(2))
>
> val keyedRatings = indexedRating.map{x => (x.product, (x.user, x.rating))}
>
> val keyedTraining = keyedRatings.sample(true, 0.8, 1L)
>
> val keyedTest = keyedRatings.subtract(keyedTraining)
>
> blocks = sc.maxParallelism
>
> println(s"Rating keys ${keyedRatings.groupByKey(blocks).count()}")
>
> println(s"Training keys ${keyedTraining.groupByKey(blocks).count()}")
>
> println(s"Test keys ${keyedTest.groupByKey(blocks).count()}")
>
> My expectation was that the println will produce exact number of keys for
> keyedRatings, keyedTraining and keyedTest but this is not the case...
>
> On MovieLens for example I am noticing the following:
>
> Rating keys 3706
>
> Training keys 3676
>
> Test keys 3470
>
> I also tried sampleByKey as follows:
>
> val keyedRatings = indexedRating.map{x => (x.product, (x.user, x.rating))}
>
> val fractions = keyedRatings.map{x=> (x._1, 0.8)}.collect.toMap
>
> val keyedTraining = keyedRatings.sampleByKey(false, fractions, 1L)
>
> val keyedTest = keyedRatings.subtract(keyedTraining)
>
> Still I get the results as:
>
> Rating keys 3706
>
> Training keys 3682
>
> Test keys 3459
>
> Any idea what's is wrong here...
>
> Are my assumptions about behavior of sample/sampleByKey on a key-value RDD
> correct ? If this is a bug I can dig deeper...
>
> Thanks.
>
> Deb

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-18 Thread Ashutosh
Hi Anant,


I have removed the counter and all possible side effects. Now I think we can go 
ahead with the testing. I have created another folder for testing. I will add 
you as a collaborator in github .


_Ashutosh


From: slcclimber [via Apache Spark Developers List] 

Sent: Monday, November 17, 2014 10:45 AM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection

Ashutosh,
The counter will certainly be an parellization issue when multiple nodes are 
used specially over massive datasets.
A better approach would be to use some thing along these lines:

val index = sc.parallelize(Range.Long(0, rdd.count, 1), rdd.partitions.size)
val rddWithIndex = rdd.zip(index)
Which zips the two RDD's in a parallelizable fashion.



If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9399.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click 
here.
NAML




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9420.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Using sampleByKey

2014-11-18 Thread Debasish Das
Sean,

I thought sampleByKey (stratified sampling) in 1.1 was designed to solve
the problem that randomSplit can't sample by key...

Xiangrui,

What's the expected behavior of sampleByKey ? In the dataset sampled using
sampleByKey the keys should match the input dataset keys right ? If it is a
bug, I can open up a JIRA and look into it...

Thanks.
Deb

On Tue, Nov 18, 2014 at 1:34 AM, Sean Owen  wrote:

> I use randomSplit to make a train/CV/test set in one go. It definitely
> produces disjoint data sets and is efficient. The problem is you can't
> do it by key.
>
> I am not sure why your subtract does not work. I suspect it is because
> the values do not partition the same way, or they don't evaluate
> equality in the expected way, but I don't see any reason why. Tuples
> work as expected here.
>
> On Tue, Nov 18, 2014 at 4:32 AM, Debasish Das 
> wrote:
> > Hi,
> >
> > I have a rdd whose key is a userId and value is (movieId, rating)...
> >
> > I want to sample 80% of the (movieId,rating) that each userId has seen
> for
> > train, rest is for test...
> >
> > val indexedRating = sc.textFile(...).map{x=> Rating(x(0), x(1), x(2))
> >
> > val keyedRatings = indexedRating.map{x => (x.product, (x.user,
> x.rating))}
> >
> > val keyedTraining = keyedRatings.sample(true, 0.8, 1L)
> >
> > val keyedTest = keyedRatings.subtract(keyedTraining)
> >
> > blocks = sc.maxParallelism
> >
> > println(s"Rating keys ${keyedRatings.groupByKey(blocks).count()}")
> >
> > println(s"Training keys ${keyedTraining.groupByKey(blocks).count()}")
> >
> > println(s"Test keys ${keyedTest.groupByKey(blocks).count()}")
> >
> > My expectation was that the println will produce exact number of keys for
> > keyedRatings, keyedTraining and keyedTest but this is not the case...
> >
> > On MovieLens for example I am noticing the following:
> >
> > Rating keys 3706
> >
> > Training keys 3676
> >
> > Test keys 3470
> >
> > I also tried sampleByKey as follows:
> >
> > val keyedRatings = indexedRating.map{x => (x.product, (x.user,
> x.rating))}
> >
> > val fractions = keyedRatings.map{x=> (x._1, 0.8)}.collect.toMap
> >
> > val keyedTraining = keyedRatings.sampleByKey(false, fractions, 1L)
> >
> > val keyedTest = keyedRatings.subtract(keyedTraining)
> >
> > Still I get the results as:
> >
> > Rating keys 3706
> >
> > Training keys 3682
> >
> > Test keys 3459
> >
> > Any idea what's is wrong here...
> >
> > Are my assumptions about behavior of sample/sampleByKey on a key-value
> RDD
> > correct ? If this is a bug I can dig deeper...
> >
> > Thanks.
> >
> > Deb
>


Re: Quantile regression in tree models

2014-11-18 Thread Alessandro Baretta
Manish,

My use case for (asymmetric) absolute error is quite trivially quantile
regression. In other words, I want to use Spark to learn conditional
cumulative distribution functions. See R's GBM quantile regression option.

If you either find or create a Jira ticket, I would be happy to give it a
shot. Is there a design doc explaining how the gradient boosting algorithm
is laid out in MLLib? I tried reading the code, but without a "Rosetta
stone" it's impossible to make sense of it.

Alex

On Mon, Nov 17, 2014 at 8:25 PM, Manish Amde  wrote:

> Hi Alessandro,
>
> I think absolute error as splitting criterion might be feasible with the
> current architecture -- I think the sufficient statistics we collect
> currently might be able to support this. Could you let us know scenarios
> where absolute error has significantly outperformed squared error for
> regression trees? Also, what's your use case that makes squared error
> undesirable.
>
> For gradient boosting, you are correct. The weak hypothesis weights refer
> to tree predictions in each of the branches. We plan to explain this in
> the 1.2 documentation and may be add some more clarifications to the
> Javadoc.
>
> I will try to search for JIRAs or create new ones and update this thread.
>
> -Manish
>
>
> On Monday, November 17, 2014, Alessandro Baretta 
> wrote:
>
>> Manish,
>>
>> Thanks for pointing me to the relevant docs. It is unfortunate that
>> absolute error is not supported yet. I can't seem to find a Jira for it.
>>
>> Now, here's the what the comments say in the current master branch:
>> /**
>>  * :: Experimental ::
>>  * A class that implements Stochastic Gradient Boosting
>>  * for regression and binary classification problems.
>>  *
>>  * The implementation is based upon:
>>  *   J.H. Friedman.  "Stochastic Gradient Boosting."  1999.
>>  *
>>  * Notes:
>>  *  - This currently can be run with several loss functions.  However,
>> only SquaredError is
>>  *fully supported.  Specifically, the loss function should be used to
>> compute the gradient
>>  *(to re-label training instances on each iteration) and to weight
>> weak hypotheses.
>>  *Currently, gradients are computed correctly for the available loss
>> functions,
>>  *but weak hypothesis weights are not computed correctly for LogLoss
>> or AbsoluteError.
>>  *Running with those losses will likely behave reasonably, but lacks
>> the same guarantees.
>> ...
>> */
>>
>> By the looks of it, the GradientBoosting API would support an absolute
>> error type loss function to perform quantile regression, except for "weak
>> hypothesis weights". Does this refer to the weights of the leaves of the
>> trees?
>>
>> Alex
>>
>> On Mon, Nov 17, 2014 at 2:24 PM, Manish Amde  wrote:
>>
>>> Hi Alessandro,
>>>
>>> MLlib v1.1 supports variance for regression and gini impurity and
>>> entropy for classification.
>>> http://spark.apache.org/docs/latest/mllib-decision-tree.html
>>>
>>> If the information gain calculation can be performed by distributed
>>> aggregation then it might be possible to plug it into the existing
>>> implementation. We want to perform such calculations (for e.g. median) for
>>> the gradient boosting models (coming up in the 1.2 release) using absolute
>>> error and deviance as loss functions but I don't think anyone is planning
>>> to work on it yet. :-)
>>>
>>> -Manish
>>>
>>> On Mon, Nov 17, 2014 at 11:11 AM, Alessandro Baretta <
>>> alexbare...@gmail.com> wrote:
>>>
 I see that, as of v. 1.1, MLLib supports regression and classification
 tree
 models. I assume this means that it uses a squared-error loss function
 for
 the first and logistic cost function for the second. I don't see support
 for quantile regression via an absolute error cost function. Or am I
 missing something?

 If, as it seems, this is missing, how do you recommend to implement it?

 Alex

>>>
>>>
>>


Re: Using sampleByKey

2014-11-18 Thread Xiangrui Meng
`sampleByKey` with the same fraction per stratum acts the same as
`sample`. The operation you want is perhaps `sampleByKeyExact` here.
However, when you use stratified sampling, there should not be many
strata. My question is why we need to split on each user's ratings. If
a user is missing in training and appears in test, we can simply
ignore it. -Xiangrui

On Tue, Nov 18, 2014 at 6:59 AM, Debasish Das  wrote:
> Sean,
>
> I thought sampleByKey (stratified sampling) in 1.1 was designed to solve
> the problem that randomSplit can't sample by key...
>
> Xiangrui,
>
> What's the expected behavior of sampleByKey ? In the dataset sampled using
> sampleByKey the keys should match the input dataset keys right ? If it is a
> bug, I can open up a JIRA and look into it...
>
> Thanks.
> Deb
>
> On Tue, Nov 18, 2014 at 1:34 AM, Sean Owen  wrote:
>
>> I use randomSplit to make a train/CV/test set in one go. It definitely
>> produces disjoint data sets and is efficient. The problem is you can't
>> do it by key.
>>
>> I am not sure why your subtract does not work. I suspect it is because
>> the values do not partition the same way, or they don't evaluate
>> equality in the expected way, but I don't see any reason why. Tuples
>> work as expected here.
>>
>> On Tue, Nov 18, 2014 at 4:32 AM, Debasish Das 
>> wrote:
>> > Hi,
>> >
>> > I have a rdd whose key is a userId and value is (movieId, rating)...
>> >
>> > I want to sample 80% of the (movieId,rating) that each userId has seen
>> for
>> > train, rest is for test...
>> >
>> > val indexedRating = sc.textFile(...).map{x=> Rating(x(0), x(1), x(2))
>> >
>> > val keyedRatings = indexedRating.map{x => (x.product, (x.user,
>> x.rating))}
>> >
>> > val keyedTraining = keyedRatings.sample(true, 0.8, 1L)
>> >
>> > val keyedTest = keyedRatings.subtract(keyedTraining)
>> >
>> > blocks = sc.maxParallelism
>> >
>> > println(s"Rating keys ${keyedRatings.groupByKey(blocks).count()}")
>> >
>> > println(s"Training keys ${keyedTraining.groupByKey(blocks).count()}")
>> >
>> > println(s"Test keys ${keyedTest.groupByKey(blocks).count()}")
>> >
>> > My expectation was that the println will produce exact number of keys for
>> > keyedRatings, keyedTraining and keyedTest but this is not the case...
>> >
>> > On MovieLens for example I am noticing the following:
>> >
>> > Rating keys 3706
>> >
>> > Training keys 3676
>> >
>> > Test keys 3470
>> >
>> > I also tried sampleByKey as follows:
>> >
>> > val keyedRatings = indexedRating.map{x => (x.product, (x.user,
>> x.rating))}
>> >
>> > val fractions = keyedRatings.map{x=> (x._1, 0.8)}.collect.toMap
>> >
>> > val keyedTraining = keyedRatings.sampleByKey(false, fractions, 1L)
>> >
>> > val keyedTest = keyedRatings.subtract(keyedTraining)
>> >
>> > Still I get the results as:
>> >
>> > Rating keys 3706
>> >
>> > Training keys 3682
>> >
>> > Test keys 3459
>> >
>> > Any idea what's is wrong here...
>> >
>> > Are my assumptions about behavior of sample/sampleByKey on a key-value
>> RDD
>> > correct ? If this is a bug I can dig deeper...
>> >
>> > Thanks.
>> >
>> > Deb
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Using sampleByKey

2014-11-18 Thread Debasish Das
For mllib PR, I will add this logic: "If a user is missing in training and
appears in test, we can simply ignore it."

I was struggling since users appear in test on which the model was not
trained on...

For our internal tests we want to cross validate on every product / user as
all of them are equally important and so I have to come up with a sampling
strategy for every user/product...

In general for stratified sampling what's the bound on strata ? Like number
of classes in a labeled dataset ~ 100 ?

On Tue, Nov 18, 2014 at 10:31 AM, Xiangrui Meng  wrote:

> `sampleByKey` with the same fraction per stratum acts the same as
> `sample`. The operation you want is perhaps `sampleByKeyExact` here.
> However, when you use stratified sampling, there should not be many
> strata. My question is why we need to split on each user's ratings. If
> a user is missing in training and appears in test, we can simply
> ignore it. -Xiangrui
>
> On Tue, Nov 18, 2014 at 6:59 AM, Debasish Das 
> wrote:
> > Sean,
> >
> > I thought sampleByKey (stratified sampling) in 1.1 was designed to solve
> > the problem that randomSplit can't sample by key...
> >
> > Xiangrui,
> >
> > What's the expected behavior of sampleByKey ? In the dataset sampled
> using
> > sampleByKey the keys should match the input dataset keys right ? If it
> is a
> > bug, I can open up a JIRA and look into it...
> >
> > Thanks.
> > Deb
> >
> > On Tue, Nov 18, 2014 at 1:34 AM, Sean Owen  wrote:
> >
> >> I use randomSplit to make a train/CV/test set in one go. It definitely
> >> produces disjoint data sets and is efficient. The problem is you can't
> >> do it by key.
> >>
> >> I am not sure why your subtract does not work. I suspect it is because
> >> the values do not partition the same way, or they don't evaluate
> >> equality in the expected way, but I don't see any reason why. Tuples
> >> work as expected here.
> >>
> >> On Tue, Nov 18, 2014 at 4:32 AM, Debasish Das  >
> >> wrote:
> >> > Hi,
> >> >
> >> > I have a rdd whose key is a userId and value is (movieId, rating)...
> >> >
> >> > I want to sample 80% of the (movieId,rating) that each userId has seen
> >> for
> >> > train, rest is for test...
> >> >
> >> > val indexedRating = sc.textFile(...).map{x=> Rating(x(0), x(1), x(2))
> >> >
> >> > val keyedRatings = indexedRating.map{x => (x.product, (x.user,
> >> x.rating))}
> >> >
> >> > val keyedTraining = keyedRatings.sample(true, 0.8, 1L)
> >> >
> >> > val keyedTest = keyedRatings.subtract(keyedTraining)
> >> >
> >> > blocks = sc.maxParallelism
> >> >
> >> > println(s"Rating keys ${keyedRatings.groupByKey(blocks).count()}")
> >> >
> >> > println(s"Training keys ${keyedTraining.groupByKey(blocks).count()}")
> >> >
> >> > println(s"Test keys ${keyedTest.groupByKey(blocks).count()}")
> >> >
> >> > My expectation was that the println will produce exact number of keys
> for
> >> > keyedRatings, keyedTraining and keyedTest but this is not the case...
> >> >
> >> > On MovieLens for example I am noticing the following:
> >> >
> >> > Rating keys 3706
> >> >
> >> > Training keys 3676
> >> >
> >> > Test keys 3470
> >> >
> >> > I also tried sampleByKey as follows:
> >> >
> >> > val keyedRatings = indexedRating.map{x => (x.product, (x.user,
> >> x.rating))}
> >> >
> >> > val fractions = keyedRatings.map{x=> (x._1, 0.8)}.collect.toMap
> >> >
> >> > val keyedTraining = keyedRatings.sampleByKey(false, fractions, 1L)
> >> >
> >> > val keyedTest = keyedRatings.subtract(keyedTraining)
> >> >
> >> > Still I get the results as:
> >> >
> >> > Rating keys 3706
> >> >
> >> > Training keys 3682
> >> >
> >> > Test keys 3459
> >> >
> >> > Any idea what's is wrong here...
> >> >
> >> > Are my assumptions about behavior of sample/sampleByKey on a key-value
> >> RDD
> >> > correct ? If this is a bug I can dig deeper...
> >> >
> >> > Thanks.
> >> >
> >> > Deb
> >>
>


Re: Using sampleByKey

2014-11-18 Thread Xiangrui Meng
If all users are equally important, then the average score should be
representative. You shouldn't worry about missing one or two. For
stratified sampling, wikipedia has a paragraph about its disadvantage:

http://en.wikipedia.org/wiki/Stratified_sampling#Disadvantages

It depends on the size of the population. For example, in the US
Census survey sampling design, there are many (>> 100) strata:

https://www.census.gov/acs/www/Downloads/survey_methodology/Chapter_4_RevisedDec2010.pdf

If you indeed want to do the split per user, you should use groupByKey
and apply reservoir sampling for ratings from each user.

-Xiangrui

On Tue, Nov 18, 2014 at 11:12 AM, Debasish Das  wrote:
> For mllib PR, I will add this logic: "If a user is missing in training and
> appears in test, we can simply ignore it."
>
> I was struggling since users appear in test on which the model was not
> trained on...
>
> For our internal tests we want to cross validate on every product / user as
> all of them are equally important and so I have to come up with a sampling
> strategy for every user/product...
>
> In general for stratified sampling what's the bound on strata ? Like number
> of classes in a labeled dataset ~ 100 ?
>
> On Tue, Nov 18, 2014 at 10:31 AM, Xiangrui Meng  wrote:
>>
>> `sampleByKey` with the same fraction per stratum acts the same as
>> `sample`. The operation you want is perhaps `sampleByKeyExact` here.
>> However, when you use stratified sampling, there should not be many
>> strata. My question is why we need to split on each user's ratings. If
>> a user is missing in training and appears in test, we can simply
>> ignore it. -Xiangrui
>>
>> On Tue, Nov 18, 2014 at 6:59 AM, Debasish Das 
>> wrote:
>> > Sean,
>> >
>> > I thought sampleByKey (stratified sampling) in 1.1 was designed to solve
>> > the problem that randomSplit can't sample by key...
>> >
>> > Xiangrui,
>> >
>> > What's the expected behavior of sampleByKey ? In the dataset sampled
>> > using
>> > sampleByKey the keys should match the input dataset keys right ? If it
>> > is a
>> > bug, I can open up a JIRA and look into it...
>> >
>> > Thanks.
>> > Deb
>> >
>> > On Tue, Nov 18, 2014 at 1:34 AM, Sean Owen  wrote:
>> >
>> >> I use randomSplit to make a train/CV/test set in one go. It definitely
>> >> produces disjoint data sets and is efficient. The problem is you can't
>> >> do it by key.
>> >>
>> >> I am not sure why your subtract does not work. I suspect it is because
>> >> the values do not partition the same way, or they don't evaluate
>> >> equality in the expected way, but I don't see any reason why. Tuples
>> >> work as expected here.
>> >>
>> >> On Tue, Nov 18, 2014 at 4:32 AM, Debasish Das
>> >> 
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > I have a rdd whose key is a userId and value is (movieId, rating)...
>> >> >
>> >> > I want to sample 80% of the (movieId,rating) that each userId has
>> >> > seen
>> >> for
>> >> > train, rest is for test...
>> >> >
>> >> > val indexedRating = sc.textFile(...).map{x=> Rating(x(0), x(1), x(2))
>> >> >
>> >> > val keyedRatings = indexedRating.map{x => (x.product, (x.user,
>> >> x.rating))}
>> >> >
>> >> > val keyedTraining = keyedRatings.sample(true, 0.8, 1L)
>> >> >
>> >> > val keyedTest = keyedRatings.subtract(keyedTraining)
>> >> >
>> >> > blocks = sc.maxParallelism
>> >> >
>> >> > println(s"Rating keys ${keyedRatings.groupByKey(blocks).count()}")
>> >> >
>> >> > println(s"Training keys ${keyedTraining.groupByKey(blocks).count()}")
>> >> >
>> >> > println(s"Test keys ${keyedTest.groupByKey(blocks).count()}")
>> >> >
>> >> > My expectation was that the println will produce exact number of keys
>> >> > for
>> >> > keyedRatings, keyedTraining and keyedTest but this is not the case...
>> >> >
>> >> > On MovieLens for example I am noticing the following:
>> >> >
>> >> > Rating keys 3706
>> >> >
>> >> > Training keys 3676
>> >> >
>> >> > Test keys 3470
>> >> >
>> >> > I also tried sampleByKey as follows:
>> >> >
>> >> > val keyedRatings = indexedRating.map{x => (x.product, (x.user,
>> >> x.rating))}
>> >> >
>> >> > val fractions = keyedRatings.map{x=> (x._1, 0.8)}.collect.toMap
>> >> >
>> >> > val keyedTraining = keyedRatings.sampleByKey(false, fractions, 1L)
>> >> >
>> >> > val keyedTest = keyedRatings.subtract(keyedTraining)
>> >> >
>> >> > Still I get the results as:
>> >> >
>> >> > Rating keys 3706
>> >> >
>> >> > Training keys 3682
>> >> >
>> >> > Test keys 3459
>> >> >
>> >> > Any idea what's is wrong here...
>> >> >
>> >> > Are my assumptions about behavior of sample/sampleByKey on a
>> >> > key-value
>> >> RDD
>> >> > correct ? If this is a bug I can dig deeper...
>> >> >
>> >> > Thanks.
>> >> >
>> >> > Deb
>> >>
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Implementing TinkerPop on top of GraphX

2014-11-18 Thread Kyle Ellrott
The new Tinkerpop3 API was different enough from V2, that it was worth
starting a new implementation rather then trying to completely refactor my
old code.
I've started a new project: https://github.com/kellrott/spark-gremlin which
compiles and runs the first set of unit tests (which it completely fails).
Most of the classes are structured in the same way they are in the Gigraph
implementation. There isn't much actual GraphX code in the project yet,
just a framework to start working in.
Hopefully this will keep the conversation going.

Kyle

On Fri, Nov 7, 2014 at 11:17 AM, Kushal Datta 
wrote:

> I think if we are going to use GraphX as the query engine in Tinkerpop3,
> then the Tinkerpop3 community is the right platform to further the
> discussion.
>
> The reason I asked the question on improving APIs in GraphX is because why
> only Gremlin, any graph DSL can exploit the GraphX APIs. Cypher has some
> good subgraph matching query interfaces which I believe can be distributed
> using GraphX apis.
>
> An edge ID is an internal attribute of the edge generated automatically,
> mostly hidden from the user. That's why adding it as an edge property might
> not be a good idea. There are several little differences like this. E.g. in
> Tinkerpop3 Gremlin implementation for Giraph, only vertex programs are
> executed in Giraph directly. The side-effect operators are mapped to
> Map-Reduce functions. In the implementation we are talking about, all of
> these operations can be done within GraphX. I will be interested to
> co-develop the query engine.
>
> @Reynold, I agree. And as I said earlier, the apis should be designed in
> such a way that it can be used in any Graph DSL.
>
> On Fri, Nov 7, 2014 at 10:59 AM, Kyle Ellrott 
> wrote:
>
>> Who here would be interested in helping to work on an implementation of
>> the Tikerpop3 Gremlin API for Spark? Is this something that should continue
>> in the Spark discussion group, or should it migrate to the Gremlin message
>> group?
>>
>> Reynold is right that there will be inherent mismatches in the APIs, and
>> there will need to be some discussions with the GraphX group about the best
>> way to go. One example would be edge ids. GraphX has vertex ids, but no
>> explicit edges ids, while Gremlin has both. Edge ids could be put into the
>> attr field, but then that means the user would have to explicitly subclass
>> their edge attribute to the edge attribute interface. Is that worth doing,
>> versus adding an id to everyones's edges?
>>
>> Kyle
>>
>>
>> On Thu, Nov 6, 2014 at 7:24 PM, Reynold Xin  wrote:
>>
>>> Some form of graph querying support would be great to have. This can be
>>> a great community project hosted outside of Spark initially, both due to
>>> the maturity of the component itself as well as the maturity of query
>>> language standards (there isn't really a dominant standard for graph ql).
>>>
>>> One thing is that GraphX API will need to evolve and probably need to
>>> provide more primitives in order to support the new ql implementation.
>>> There might also be inherent mismatches in the way the external API is
>>> defined vs what GraphX can support. We should discuss those on a
>>> case-by-case basis.
>>>
>>>
>>> On Thu, Nov 6, 2014 at 5:42 PM, Kyle Ellrott 
>>> wrote:
>>>
 I think its best to look to existing standard rather then try to make
 your own. Of course small additions would need to be added to make it
 valuable for the Spark community, like a method similar to Gremlin's
 'table' function, that produces an RDD instead.
 But there may be a lot of extra code and data structures that would
 need to be added to make it work, and those may not be directly applicable
 to all GraphX users. I think it would be best run as a separate
 module/project that builds directly on top of GraphX.

 Kyle



 On Thu, Nov 6, 2014 at 4:39 PM, York, Brennon <
 brennon.y...@capitalone.com> wrote:

> My personal 2c is that, since GraphX is just beginning to provide a
> full featured graph API, I think it would be better to align with the
> TinkerPop group rather than roll our own. In my mind the benefits out way
> the detriments as follows:
>
> Benefits:
> * GraphX gains the ability to become another core tenant within the
> TinkerPop community allowing a more diverse group of users into the Spark
> ecosystem.
> * TinkerPop can continue to maintain and own a solid / feature-rich
> graph API that has already been accepted by a wide audience, relieving the
> pressure of “one off” API additions from the GraphX team.
> * GraphX can demonstrate its ability to be a key player in the GraphDB
> space sitting inline with other major distributions (Neo4j, Titan, etc.).
> * Allows for the abstract graph traversal logic (query API) to be
> owned and maintained by a group already proven on the topic.
>
> Drawbacks:
> * GraphX doesn’t own the API

Re: Quantile regression in tree models

2014-11-18 Thread Manish Amde
Hi Alex,

Here is the ticket for refining tree predictions. Let's discuss this
further on the JIRA.
https://issues.apache.org/jira/browse/SPARK-4240

There is no ticket yet for quantile regression. It will be great if you
could create one and note down the corresponding loss function and gradient
calculations. There is a design doc that Joseph Bradley wrote for
supporting boosting algorithms with generic weak learners but it doesn't
include implementation details. I can definitely help you understand the
existing code if you decide to work on it. However, let's discuss the
relevance of the algorithm to MLlib on the JIRA. It seems like a nice
addition though I am not sure about the implementation complexity. I will
be great to see what others think.

-Manish

On Tue, Nov 18, 2014 at 10:07 AM, Alessandro Baretta 
wrote:

> Manish,
>
> My use case for (asymmetric) absolute error is quite trivially quantile
> regression. In other words, I want to use Spark to learn conditional
> cumulative distribution functions. See R's GBM quantile regression option.
>
> If you either find or create a Jira ticket, I would be happy to give it a
> shot. Is there a design doc explaining how the gradient boosting algorithm
> is laid out in MLLib? I tried reading the code, but without a "Rosetta
> stone" it's impossible to make sense of it.
>
> Alex
>
> On Mon, Nov 17, 2014 at 8:25 PM, Manish Amde  wrote:
>
>> Hi Alessandro,
>>
>> I think absolute error as splitting criterion might be feasible with the
>> current architecture -- I think the sufficient statistics we collect
>> currently might be able to support this. Could you let us know scenarios
>> where absolute error has significantly outperformed squared error for
>> regression trees? Also, what's your use case that makes squared error
>> undesirable.
>>
>> For gradient boosting, you are correct. The weak hypothesis weights refer
>> to tree predictions in each of the branches. We plan to explain this in
>> the 1.2 documentation and may be add some more clarifications to the
>> Javadoc.
>>
>> I will try to search for JIRAs or create new ones and update this thread.
>>
>> -Manish
>>
>>
>> On Monday, November 17, 2014, Alessandro Baretta 
>> wrote:
>>
>>> Manish,
>>>
>>> Thanks for pointing me to the relevant docs. It is unfortunate that
>>> absolute error is not supported yet. I can't seem to find a Jira for it.
>>>
>>> Now, here's the what the comments say in the current master branch:
>>> /**
>>>  * :: Experimental ::
>>>  * A class that implements Stochastic Gradient Boosting
>>>  * for regression and binary classification problems.
>>>  *
>>>  * The implementation is based upon:
>>>  *   J.H. Friedman.  "Stochastic Gradient Boosting."  1999.
>>>  *
>>>  * Notes:
>>>  *  - This currently can be run with several loss functions.  However,
>>> only SquaredError is
>>>  *fully supported.  Specifically, the loss function should be used
>>> to compute the gradient
>>>  *(to re-label training instances on each iteration) and to weight
>>> weak hypotheses.
>>>  *Currently, gradients are computed correctly for the available loss
>>> functions,
>>>  *but weak hypothesis weights are not computed correctly for LogLoss
>>> or AbsoluteError.
>>>  *Running with those losses will likely behave reasonably, but lacks
>>> the same guarantees.
>>> ...
>>> */
>>>
>>> By the looks of it, the GradientBoosting API would support an absolute
>>> error type loss function to perform quantile regression, except for "weak
>>> hypothesis weights". Does this refer to the weights of the leaves of the
>>> trees?
>>>
>>> Alex
>>>
>>> On Mon, Nov 17, 2014 at 2:24 PM, Manish Amde 
>>> wrote:
>>>
 Hi Alessandro,

 MLlib v1.1 supports variance for regression and gini impurity and
 entropy for classification.
 http://spark.apache.org/docs/latest/mllib-decision-tree.html

 If the information gain calculation can be performed by distributed
 aggregation then it might be possible to plug it into the existing
 implementation. We want to perform such calculations (for e.g. median) for
 the gradient boosting models (coming up in the 1.2 release) using absolute
 error and deviance as loss functions but I don't think anyone is planning
 to work on it yet. :-)

 -Manish

 On Mon, Nov 17, 2014 at 11:11 AM, Alessandro Baretta <
 alexbare...@gmail.com> wrote:

> I see that, as of v. 1.1, MLLib supports regression and classification
> tree
> models. I assume this means that it uses a squared-error loss function
> for
> the first and logistic cost function for the second. I don't see
> support
> for quantile regression via an absolute error cost function. Or am I
> missing something?
>
> If, as it seems, this is missing, how do you recommend to implement it?
>
> Alex
>


>>>
>


Intro to using IntelliJ to debug SPARK-1.1 Apps with mvn/sbt (for beginners)

2014-11-18 Thread Yiming (John) Zhang
Hi,

 

I noticed it is hard to find a thorough introduction to using IntelliJ to
debug SPARK-1.1 Apps with mvn/sbt, which is not straightforward for
beginners. So I spent several days to figure it out and hope that it would
be helpful for beginners like me and that professionals can help me improve
it. (The intro with figures can be found at:
http://kylinx.com/spark/Debug-Spark-in-IntelliJ.htm)

 

(1) Install the Scala plugin

 

(2) Download, unzip and open spark-1.1.0 in IntelliJ 

a) mvn: File -> Open. 

Select the Spark source folder (e.g., /root/spark-1.1.0). Maybe it will
take a long time to download and compile a lot of things

b) sbt: File -> Import Project. 

Select "Import project from external model", then choose SBT project,
click Next. Input the Spark source path (e.g., /root/spark-1.1.0) for "SBT
project", and select Use auto-import.

 

(3) First compile and run spark examples in the console to ensure everything
OK

# mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package

# ./sbt/sbt assembly -Phadoop-2.2 -Dhadoop.version=2.2.0

 

(4) Add the compiled spark-hadoop library (spark-assembly-1.1.0-hadoop2.2.0)
to "Libraries" (File -> Project Structure. -> Libraries -> green +). And
choose modules that use it (right-click the library and click "Add to
Modules"). It seems only spark-examples need it.

 

(5) In the "Dependencies" page of the modules using this library, ensure
that the "Scope" of this library is "Compile" (File -> Project Structure. ->
Modules)

(6) For sbt, it seems that we have to label the scope of all other hadoop
dependencies (SBT: org.apache.hadoop.hadoop-*) as "Test" (due to poor
Internet connection?) And this has to be done every time opening IntelliJ
(due to a bug?)

 

(7) Configure debug environment (using LogQuery as an example). Run -> Edit
Configurations.

Main class: org.apache.spark.examples.LogQuery

VM options: -Dspark.master=local

Working directory: /root/spark-1.1.0

Use classpath of module: spark-examples_2.10

Before launch: External tool: mvn

Program: /root/Programs/apache-maven-3.2.1/bin/mvn

Parameters: -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests package

Working directory: /root/spark-1.1.0

Before launch: External tool: sbt

Program: /root/spark-1.1.0/sbt/sbt

Parameters: -Phadoop-2.2 -Dhadoop.version=2.2.0 assembly 

Working directory: /root/spark-1.1.0

 

(8) Click Run -> Debug 'LogQuery' to start debugging

 

 

Cheers,

Yiming



Re: Intro to using IntelliJ to debug SPARK-1.1 Apps with mvn/sbt (for beginners)

2014-11-18 Thread Chen He
Thank you Yiming. It is helpful.

Regards!

Chen

On Tue, Nov 18, 2014 at 8:00 PM, Yiming (John) Zhang 
wrote:

> Hi,
>
>
>
> I noticed it is hard to find a thorough introduction to using IntelliJ to
> debug SPARK-1.1 Apps with mvn/sbt, which is not straightforward for
> beginners. So I spent several days to figure it out and hope that it would
> be helpful for beginners like me and that professionals can help me improve
> it. (The intro with figures can be found at:
> http://kylinx.com/spark/Debug-Spark-in-IntelliJ.htm)
>
>
>
> (1) Install the Scala plugin
>
>
>
> (2) Download, unzip and open spark-1.1.0 in IntelliJ
>
> a) mvn: File -> Open.
>
> Select the Spark source folder (e.g., /root/spark-1.1.0). Maybe it will
> take a long time to download and compile a lot of things
>
> b) sbt: File -> Import Project.
>
> Select "Import project from external model", then choose SBT project,
> click Next. Input the Spark source path (e.g., /root/spark-1.1.0) for "SBT
> project", and select Use auto-import.
>
>
>
> (3) First compile and run spark examples in the console to ensure
> everything
> OK
>
> # mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package
>
> # ./sbt/sbt assembly -Phadoop-2.2 -Dhadoop.version=2.2.0
>
>
>
> (4) Add the compiled spark-hadoop library
> (spark-assembly-1.1.0-hadoop2.2.0)
> to "Libraries" (File -> Project Structure. -> Libraries -> green +). And
> choose modules that use it (right-click the library and click "Add to
> Modules"). It seems only spark-examples need it.
>
>
>
> (5) In the "Dependencies" page of the modules using this library, ensure
> that the "Scope" of this library is "Compile" (File -> Project Structure.
> ->
> Modules)
>
> (6) For sbt, it seems that we have to label the scope of all other hadoop
> dependencies (SBT: org.apache.hadoop.hadoop-*) as "Test" (due to poor
> Internet connection?) And this has to be done every time opening IntelliJ
> (due to a bug?)
>
>
>
> (7) Configure debug environment (using LogQuery as an example). Run -> Edit
> Configurations.
>
> Main class: org.apache.spark.examples.LogQuery
>
> VM options: -Dspark.master=local
>
> Working directory: /root/spark-1.1.0
>
> Use classpath of module: spark-examples_2.10
>
> Before launch: External tool: mvn
>
> Program: /root/Programs/apache-maven-3.2.1/bin/mvn
>
> Parameters: -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests package
>
> Working directory: /root/spark-1.1.0
>
> Before launch: External tool: sbt
>
> Program: /root/spark-1.1.0/sbt/sbt
>
> Parameters: -Phadoop-2.2 -Dhadoop.version=2.2.0 assembly
>
> Working directory: /root/spark-1.1.0
>
>
>
> (8) Click Run -> Debug 'LogQuery' to start debugging
>
>
>
>
>
> Cheers,
>
> Yiming
>
>


Re: Intro to using IntelliJ to debug SPARK-1.1 Apps with mvn/sbt (for beginners)

2014-11-18 Thread Chester @work
For sbt
You can simplify run
sbt/sbt gen-idea 

To generate the IntelliJ idea project module for you. You can the just open the 
generated project, which includes all the needed dependencies 

Sent from my iPhone

> On Nov 18, 2014, at 8:26 PM, Chen He  wrote:
> 
> Thank you Yiming. It is helpful.
> 
> Regards!
> 
> Chen
> 
> On Tue, Nov 18, 2014 at 8:00 PM, Yiming (John) Zhang 
> wrote:
> 
>> Hi,
>> 
>> 
>> 
>> I noticed it is hard to find a thorough introduction to using IntelliJ to
>> debug SPARK-1.1 Apps with mvn/sbt, which is not straightforward for
>> beginners. So I spent several days to figure it out and hope that it would
>> be helpful for beginners like me and that professionals can help me improve
>> it. (The intro with figures can be found at:
>> http://kylinx.com/spark/Debug-Spark-in-IntelliJ.htm)
>> 
>> 
>> 
>> (1) Install the Scala plugin
>> 
>> 
>> 
>> (2) Download, unzip and open spark-1.1.0 in IntelliJ
>> 
>> a) mvn: File -> Open.
>> 
>>Select the Spark source folder (e.g., /root/spark-1.1.0). Maybe it will
>> take a long time to download and compile a lot of things
>> 
>> b) sbt: File -> Import Project.
>> 
>>Select "Import project from external model", then choose SBT project,
>> click Next. Input the Spark source path (e.g., /root/spark-1.1.0) for "SBT
>> project", and select Use auto-import.
>> 
>> 
>> 
>> (3) First compile and run spark examples in the console to ensure
>> everything
>> OK
>> 
>> # mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package
>> 
>> # ./sbt/sbt assembly -Phadoop-2.2 -Dhadoop.version=2.2.0
>> 
>> 
>> 
>> (4) Add the compiled spark-hadoop library
>> (spark-assembly-1.1.0-hadoop2.2.0)
>> to "Libraries" (File -> Project Structure. -> Libraries -> green +). And
>> choose modules that use it (right-click the library and click "Add to
>> Modules"). It seems only spark-examples need it.
>> 
>> 
>> 
>> (5) In the "Dependencies" page of the modules using this library, ensure
>> that the "Scope" of this library is "Compile" (File -> Project Structure.
>> ->
>> Modules)
>> 
>> (6) For sbt, it seems that we have to label the scope of all other hadoop
>> dependencies (SBT: org.apache.hadoop.hadoop-*) as "Test" (due to poor
>> Internet connection?) And this has to be done every time opening IntelliJ
>> (due to a bug?)
>> 
>> 
>> 
>> (7) Configure debug environment (using LogQuery as an example). Run -> Edit
>> Configurations.
>> 
>> Main class: org.apache.spark.examples.LogQuery
>> 
>> VM options: -Dspark.master=local
>> 
>> Working directory: /root/spark-1.1.0
>> 
>> Use classpath of module: spark-examples_2.10
>> 
>> Before launch: External tool: mvn
>> 
>>Program: /root/Programs/apache-maven-3.2.1/bin/mvn
>> 
>>Parameters: -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests package
>> 
>>Working directory: /root/spark-1.1.0
>> 
>> Before launch: External tool: sbt
>> 
>>Program: /root/spark-1.1.0/sbt/sbt
>> 
>>Parameters: -Phadoop-2.2 -Dhadoop.version=2.2.0 assembly
>> 
>>Working directory: /root/spark-1.1.0
>> 
>> 
>> 
>> (8) Click Run -> Debug 'LogQuery' to start debugging
>> 
>> 
>> 
>> 
>> 
>> Cheers,
>> 
>> Yiming
>> 
>> 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Apache infra github sync down

2014-11-18 Thread Patrick Wendell
Hey All,

The Apache-->github mirroring is not working right now and hasn't been
working fo more than 24 hours. This means that pull requests will not
appear as closed even though they have been merged. It also causes
diffs to display incorrectly in some cases. If you'd like to follow
progress by Apache infra on this issue you can watch this JIRA:

https://issues.apache.org/jira/browse/INFRA-8654

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Apache infra github sync down

2014-11-18 Thread Reynold Xin
This basically stops us from merging patches. I'm wondering if it is
possible for ASF to give some Spark committers write permission to github
repo. In that case, if the sync tool is down, we can manually push
periodically.

On Tue, Nov 18, 2014 at 10:24 PM, Patrick Wendell 
wrote:

> Hey All,
>
> The Apache-->github mirroring is not working right now and hasn't been
> working fo more than 24 hours. This means that pull requests will not
> appear as closed even though they have been merged. It also causes
> diffs to display incorrectly in some cases. If you'd like to follow
> progress by Apache infra on this issue you can watch this JIRA:
>
> https://issues.apache.org/jira/browse/INFRA-8654
>
> - Patrick
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


re: Intro to using IntelliJ to debug SPARK-1.1 Apps with mvn/sbt (for beginners)

2014-11-18 Thread Yiming (John) Zhang
Hi Chester, thank you for your reply. But I tried this approach and it
failed. It seems that there are more difficulty using sbt in IntelliJ than
expected.

And according to some references "# sbt/sbt gen-idea" is not necessary
(after Spark-1.0.0?), you can simply import the spark project and IntelliJ
will automatically generate the dependencies (but as described here, with
some possible mistakes that may fail the compilation).

Cheers,
Yiming

-邮件原件-
发件人: Chester @work [mailto:ches...@alpinenow.com] 
发送时间: 2014年11月19日 13:00
收件人: Chen He
抄送: sdi...@gmail.com; dev@spark.apache.org
主题: Re: Intro to using IntelliJ to debug SPARK-1.1 Apps with mvn/sbt (for
beginners)

For sbt
You can simplify run
sbt/sbt gen-idea 

To generate the IntelliJ idea project module for you. You can the just open
the generated project, which includes all the needed dependencies 

Sent from my iPhone

> On Nov 18, 2014, at 8:26 PM, Chen He  wrote:
> 
> Thank you Yiming. It is helpful.
> 
> Regards!
> 
> Chen
> 
> On Tue, Nov 18, 2014 at 8:00 PM, Yiming (John) Zhang 
> 
> wrote:
> 
>> Hi,
>> 
>> 
>> 
>> I noticed it is hard to find a thorough introduction to using 
>> IntelliJ to debug SPARK-1.1 Apps with mvn/sbt, which is not 
>> straightforward for beginners. So I spent several days to figure it 
>> out and hope that it would be helpful for beginners like me and that 
>> professionals can help me improve it. (The intro with figures can be
found at:
>> http://kylinx.com/spark/Debug-Spark-in-IntelliJ.htm)
>> 
>> 
>> 
>> (1) Install the Scala plugin
>> 
>> 
>> 
>> (2) Download, unzip and open spark-1.1.0 in IntelliJ
>> 
>> a) mvn: File -> Open.
>> 
>>Select the Spark source folder (e.g., /root/spark-1.1.0). Maybe it 
>> will take a long time to download and compile a lot of things
>> 
>> b) sbt: File -> Import Project.
>> 
>>Select "Import project from external model", then choose SBT 
>> project, click Next. Input the Spark source path (e.g., 
>> /root/spark-1.1.0) for "SBT project", and select Use auto-import.
>> 
>> 
>> 
>> (3) First compile and run spark examples in the console to ensure 
>> everything OK
>> 
>> # mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package
>> 
>> # ./sbt/sbt assembly -Phadoop-2.2 -Dhadoop.version=2.2.0
>> 
>> 
>> 
>> (4) Add the compiled spark-hadoop library
>> (spark-assembly-1.1.0-hadoop2.2.0)
>> to "Libraries" (File -> Project Structure. -> Libraries -> green +). 
>> And choose modules that use it (right-click the library and click 
>> "Add to Modules"). It seems only spark-examples need it.
>> 
>> 
>> 
>> (5) In the "Dependencies" page of the modules using this library, 
>> ensure that the "Scope" of this library is "Compile" (File -> Project
Structure.
>> ->
>> Modules)
>> 
>> (6) For sbt, it seems that we have to label the scope of all other 
>> hadoop dependencies (SBT: org.apache.hadoop.hadoop-*) as "Test" (due 
>> to poor Internet connection?) And this has to be done every time 
>> opening IntelliJ (due to a bug?)
>> 
>> 
>> 
>> (7) Configure debug environment (using LogQuery as an example). Run 
>> -> Edit Configurations.
>> 
>> Main class: org.apache.spark.examples.LogQuery
>> 
>> VM options: -Dspark.master=local
>> 
>> Working directory: /root/spark-1.1.0
>> 
>> Use classpath of module: spark-examples_2.10
>> 
>> Before launch: External tool: mvn
>> 
>>Program: /root/Programs/apache-maven-3.2.1/bin/mvn
>> 
>>Parameters: -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests 
>> package
>> 
>>Working directory: /root/spark-1.1.0
>> 
>> Before launch: External tool: sbt
>> 
>>Program: /root/spark-1.1.0/sbt/sbt
>> 
>>Parameters: -Phadoop-2.2 -Dhadoop.version=2.2.0 assembly
>> 
>>Working directory: /root/spark-1.1.0
>> 
>> 
>> 
>> (8) Click Run -> Debug 'LogQuery' to start debugging
>> 
>> 
>> 
>> 
>> 
>> Cheers,
>> 
>> Yiming
>> 
>> 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org