Sorry, that was incomplete information, I think Spark's compression helped
(not sure how much though) that the actual memory requirement may have been
smaller.
On Fri, Apr 18, 2014 at 3:16 PM, Sung Hwan Chung
wrote:
> I would argue that memory in clusters is still a limited resource and it's
> s
I would argue that memory in clusters is still a limited resource and it's
still beneficial to use memory as economically as possible. Let's say that
you are training a gradient boosted model in Spark, which could conceivably
take several hours to build hundreds to thousands of trees. You do not wa
I don't think the YARN default of max 8GB container size is a good
justification for limiting memory per worker. This is a sort of arbitrary
number that came from an era where MapReduce was the main YARN application
and machines generally had less memory. I expect to see this to get
configured as
On Fri, Apr 18, 2014 at 7:31 PM, Sung Hwan Chung
wrote:
> Debasish,
>
> Unfortunately, we are bound to YARN, at least for the time being, because
> that's what most of our customers would be using (unless, all the Hadoop
> vendors start supporting standalone Spark - I think Cloudera might do
> tha
Sorry for arriving late to the party! Evan has clearly explained the
current implementation, our future plans and key differences with the
PLANET paper. I don't think I can add more to his comments. :-)
I apologize for not creating the corresponding JIRA tickets for the tree
improvements (multicla
Debasish,
Unfortunately, we are bound to YARN, at least for the time being, because
that's what most of our customers would be using (unless, all the Hadoop
vendors start supporting standalone Spark - I think Cloudera might do
that?).
On Fri, Apr 18, 2014 at 11:12 AM, Debasish Das wrote:
> Sp
Spark on YARN is a big pain due to the strict memory requirement per
container...
If you are stress testing it, could you use a standalone cluster and see at
which feature upper bound does per worker RAM requirement reaches 16 GB or
more...it is possible to get 16 GB instances on EC2 these days wi
Thanks for the info on mem requirement.
I think that a lot of businesses would probably prefer to use Spark on top
of YARN, since that's what they invest on - a large Hadoop cluster. And the
default setting for YARN seems to cap memory per container to 8 GB - so
ideally, we would like to use a lot
Interesting, and thanks for the thoughts.
I think we're on the same page with 100s of millions of records. We've
tested the tree implementation in mllib on 1b rows and up to 100 features -
though this isn't hitting the 1000s of features you mention.
Obviously multi class support isn't there yet,
Hi,
Stratosphere does not have a real RF implementation yet, there is only a
prototype that has been developed by students in a university course
which is far from production usage at this stage.
--sebastian
On 04/18/2014 10:31 AM, Sean Owen wrote:
Mahout RDF is fairly old code. If you try
sorry I mismatched the link, it should be
https://gist.github.com/wpm/6454814
and the algorithm is not ExtraTrees but a basic ensemble of boosted trees.
2014-04-18 10:31 GMT+02:00 Eustache DIEMERT :
> Another option is to use ExtraTrees as provided by scikit-learn with
> pyspark:
>
> https://g
Is there a PR or issue where GBT / RF progress in MLLib is tracked ?
2014-04-17 21:11 GMT+02:00 Evan R. Sparks :
> Sorry - I meant to say that "Multiclass classification, Gradient
> Boosting, and Random Forest support based on the recent Decision Tree
> implementation in MLlib is planned and com
Another option is to use ExtraTrees as provided by scikit-learn with
pyspark:
https://github.com/pydata/pyrallel/blob/master/pyrallel/ensemble.py#L27-L59
this is a proof of concept right now and should be hacked to what you need,
but the core decision tree implementation is highly optimized and c
Mahout RDF is fairly old code. If you try it, try to use 1.0-SNAPSHOT,
as you will almost certainly need this patch to make it run reasonably
fast: https://issues.apache.org/jira/browse/MAHOUT-1419
I have not tried Stratosphere here.
Since we are on the subject of RDF on Hadoop, possibly on M/R,
Have anyone tried mahout RF or Stratosphere RF with spark. Any comments.
Regards,
Laeeq
On Friday, April 18, 2014 3:11 AM, Sung Hwan Chung
wrote:
Yes, it should be data specific and perhaps we're biased toward the data sets
that we are playing with. To put things in perspective, we're highly
Yes, it should be data specific and perhaps we're biased toward the data
sets that we are playing with. To put things in perspective, we're highly
interested in (and I believe, our customers are):
1. large (hundreds of millions of rows)
2. multi-class classification - nowadays, dozens of target ca
What kind of data are you training on? These effects are *highly* data
dependent, and while saying "the depth of 10 is simply not adequate to
build high-accuracy models" may be accurate for the particular problem
you're modeling, it is not true in general. From a statistical perspective,
I consider
I believe that they show one example comparing depth 1 ensemble vs depth 3
ensemble but it is based on boosting, not bagging.
On Thu, Apr 17, 2014 at 2:21 PM, Debasish Das wrote:
> Evan,
>
> Was not mllib decision tree implemented using ideas from Google's PLANET
> paper...do the paper also prop
Evan,
Was not mllib decision tree implemented using ideas from Google's PLANET
paper...do the paper also propose to grow a shallow tree ?
Thanks.
Deb
On Thu, Apr 17, 2014 at 1:52 PM, Sung Hwan Chung
wrote:
> Additionally, the 'random features per node' (or mtry in R) is a very
> important feat
Additionally, the 'random features per node' (or mtry in R) is a very
important feature for Random Forest. The variance reduction comes if the
trees are decorrelated from each other and often the random features per
node does more than bootstrap samples. And this is something that would
have to be
Well, if you read the original paper,
http://oz.berkeley.edu/~breiman/randomforest2001.pdf
"Grow the tree using CART methodology to maximum size and do not prune."
Now, the elements of statistical learning book on page 598 says that you
could potentially overfit fully-grown regression random fores
Hmm... can you provide some pointers to examples where deep trees are
helpful?
Typically with Decision Trees you limit depth (either directly or
indirectly with minimum node size and minimum improvement criteria) to
avoid overfitting. I agree with the assessment that forests are a variance
reducti
Evan,
I actually haven't heard of 'shallow' random forest. I think that the only
scenarios where shallow trees are useful are boosting scenarios.
AFAIK, Random Forest is a variance reducing technique and doesn't do much
about bias (although some people claim that it does have some bias reducing
e
Sorry - I meant to say that "Multiclass classification, Gradient Boosting,
and Random Forest support based on the recent Decision Tree implementation
in MLlib is planned and coming soon."
On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks wrote:
> Multiclass classification, Gradient Boosting, and
Multiclass classification, Gradient Boosting, and Random Forest support for
based on the recent Decision Tree implementation in MLlib.
Sung - I'd be curious to hear about your use of decision trees (and
forests) where you want to go to 100+ depth. My experience with random
forests has been that pe
Debasish, we've tested the MLLib decision tree a bit and it eats up too
much memory for RF purposes.
Once the tree got to depth 8~9, it was easy to get heap exception, even
with 2~4 GB of memory per worker.
With RF, it's very easy to get 100+ depth in RF with even only 100,000+
rows (because trees
Mllib has decision treethere is a rf pr which is not active nowtake
that and swap the tree builder with the fast tree builder that's in
mllib...search for the spark jira...the code is based on google planet
paper. ..
I am sure people in devlist are already working on it...send an email to
27 matches
Mail list logo