subject:"Re\: Random Forest on Spark"

Re: Random Forest on Spark

2014-04-18 Thread Sung Hwan Chung

Sorry, that was incomplete information, I think Spark's compression helped (not sure how much though) that the actual memory requirement may have been smaller. On Fri, Apr 18, 2014 at 3:16 PM, Sung Hwan Chung wrote: > I would argue that memory in clusters is still a limited resource and it's > s

Re: Random Forest on Spark

2014-04-18 Thread Sung Hwan Chung

I would argue that memory in clusters is still a limited resource and it's still beneficial to use memory as economically as possible. Let's say that you are training a gradient boosted model in Spark, which could conceivably take several hours to build hundreds to thousands of trees. You do not wa

Re: Random Forest on Spark

2014-04-18 Thread Sandy Ryza

I don't think the YARN default of max 8GB container size is a good justification for limiting memory per worker. This is a sort of arbitrary number that came from an era where MapReduce was the main YARN application and machines generally had less memory. I expect to see this to get configured as

Re: Random Forest on Spark

2014-04-18 Thread Sean Owen

On Fri, Apr 18, 2014 at 7:31 PM, Sung Hwan Chung wrote: > Debasish, > > Unfortunately, we are bound to YARN, at least for the time being, because > that's what most of our customers would be using (unless, all the Hadoop > vendors start supporting standalone Spark - I think Cloudera might do > tha

Re: Random Forest on Spark

2014-04-18 Thread Manish Amde

Sorry for arriving late to the party! Evan has clearly explained the current implementation, our future plans and key differences with the PLANET paper. I don't think I can add more to his comments. :-) I apologize for not creating the corresponding JIRA tickets for the tree improvements (multicla

Re: Random Forest on Spark

2014-04-18 Thread Sung Hwan Chung

Debasish, Unfortunately, we are bound to YARN, at least for the time being, because that's what most of our customers would be using (unless, all the Hadoop vendors start supporting standalone Spark - I think Cloudera might do that?). On Fri, Apr 18, 2014 at 11:12 AM, Debasish Das wrote: > Sp

Re: Random Forest on Spark

2014-04-18 Thread Debasish Das

Spark on YARN is a big pain due to the strict memory requirement per container... If you are stress testing it, could you use a standalone cluster and see at which feature upper bound does per worker RAM requirement reaches 16 GB or more...it is possible to get 16 GB instances on EC2 these days wi

Re: Random Forest on Spark

2014-04-18 Thread Sung Hwan Chung

Thanks for the info on mem requirement. I think that a lot of businesses would probably prefer to use Spark on top of YARN, since that's what they invest on - a large Hadoop cluster. And the default setting for YARN seems to cap memory per container to 8 GB - so ideally, we would like to use a lot

Re: Random Forest on Spark

2014-04-18 Thread Evan R. Sparks

Interesting, and thanks for the thoughts. I think we're on the same page with 100s of millions of records. We've tested the tree implementation in mllib on 1b rows and up to 100 features - though this isn't hitting the 1000s of features you mention. Obviously multi class support isn't there yet,

Re: Re: Random Forest on Spark

2014-04-18 Thread Sebastian Schelter

Hi, Stratosphere does not have a real RF implementation yet, there is only a prototype that has been developed by students in a university course which is far from production usage at this stage. --sebastian On 04/18/2014 10:31 AM, Sean Owen wrote: Mahout RDF is fairly old code. If you try

Re: Random Forest on Spark

2014-04-18 Thread Eustache DIEMERT

sorry I mismatched the link, it should be https://gist.github.com/wpm/6454814 and the algorithm is not ExtraTrees but a basic ensemble of boosted trees. 2014-04-18 10:31 GMT+02:00 Eustache DIEMERT : > Another option is to use ExtraTrees as provided by scikit-learn with > pyspark: > > https://g

Re: Random Forest on Spark

2014-04-18 Thread Eustache DIEMERT

Is there a PR or issue where GBT / RF progress in MLLib is tracked ? 2014-04-17 21:11 GMT+02:00 Evan R. Sparks : > Sorry - I meant to say that "Multiclass classification, Gradient > Boosting, and Random Forest support based on the recent Decision Tree > implementation in MLlib is planned and com

Re: Random Forest on Spark

2014-04-18 Thread Eustache DIEMERT

Another option is to use ExtraTrees as provided by scikit-learn with pyspark: https://github.com/pydata/pyrallel/blob/master/pyrallel/ensemble.py#L27-L59 this is a proof of concept right now and should be hacked to what you need, but the core decision tree implementation is highly optimized and c

Re: Random Forest on Spark

2014-04-18 Thread Sean Owen

Mahout RDF is fairly old code. If you try it, try to use 1.0-SNAPSHOT, as you will almost certainly need this patch to make it run reasonably fast: https://issues.apache.org/jira/browse/MAHOUT-1419 I have not tried Stratosphere here. Since we are on the subject of RDF on Hadoop, possibly on M/R,

Re: Random Forest on Spark

2014-04-18 Thread Laeeq Ahmed

Have anyone tried mahout RF or Stratosphere RF with spark. Any comments. Regards, Laeeq On Friday, April 18, 2014 3:11 AM, Sung Hwan Chung wrote: Yes, it should be data specific and perhaps we're biased toward the data sets that we are playing with. To put things in perspective, we're highly

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung

Yes, it should be data specific and perhaps we're biased toward the data sets that we are playing with. To put things in perspective, we're highly interested in (and I believe, our customers are): 1. large (hundreds of millions of rows) 2. multi-class classification - nowadays, dozens of target ca

Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks

What kind of data are you training on? These effects are *highly* data dependent, and while saying "the depth of 10 is simply not adequate to build high-accuracy models" may be accurate for the particular problem you're modeling, it is not true in general. From a statistical perspective, I consider

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung

I believe that they show one example comparing depth 1 ensemble vs depth 3 ensemble but it is based on boosting, not bagging. On Thu, Apr 17, 2014 at 2:21 PM, Debasish Das wrote: > Evan, > > Was not mllib decision tree implemented using ideas from Google's PLANET > paper...do the paper also prop

Re: Random Forest on Spark

2014-04-17 Thread Debasish Das

Evan, Was not mllib decision tree implemented using ideas from Google's PLANET paper...do the paper also propose to grow a shallow tree ? Thanks. Deb On Thu, Apr 17, 2014 at 1:52 PM, Sung Hwan Chung wrote: > Additionally, the 'random features per node' (or mtry in R) is a very > important feat

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung

Additionally, the 'random features per node' (or mtry in R) is a very important feature for Random Forest. The variance reduction comes if the trees are decorrelated from each other and often the random features per node does more than bootstrap samples. And this is something that would have to be

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung

Well, if you read the original paper, http://oz.berkeley.edu/~breiman/randomforest2001.pdf "Grow the tree using CART methodology to maximum size and do not prune." Now, the elements of statistical learning book on page 598 says that you could potentially overfit fully-grown regression random fores

Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks

Hmm... can you provide some pointers to examples where deep trees are helpful? Typically with Decision Trees you limit depth (either directly or indirectly with minimum node size and minimum improvement criteria) to avoid overfitting. I agree with the assessment that forests are a variance reducti

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung

Evan, I actually haven't heard of 'shallow' random forest. I think that the only scenarios where shallow trees are useful are boosting scenarios. AFAIK, Random Forest is a variance reducing technique and doesn't do much about bias (although some people claim that it does have some bias reducing e

Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks

Sorry - I meant to say that "Multiclass classification, Gradient Boosting, and Random Forest support based on the recent Decision Tree implementation in MLlib is planned and coming soon." On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks wrote: > Multiclass classification, Gradient Boosting, and

Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks

Multiclass classification, Gradient Boosting, and Random Forest support for based on the recent Decision Tree implementation in MLlib. Sung - I'd be curious to hear about your use of decision trees (and forests) where you want to go to 100+ depth. My experience with random forests has been that pe

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung

Debasish, we've tested the MLLib decision tree a bit and it eats up too much memory for RF purposes. Once the tree got to depth 8~9, it was easy to get heap exception, even with 2~4 GB of memory per worker. With RF, it's very easy to get 100+ depth in RF with even only 100,000+ rows (because trees

Re: Random Forest on Spark

2014-04-17 Thread Debasish Das

Mllib has decision treethere is a rf pr which is not active nowtake that and swap the tree builder with the fast tree builder that's in mllib...search for the spark jira...the code is based on google planet paper. .. I am sure people in devlist are already working on it...send an email to

Re: Random Forest on Spark

Re: Random Forest on Spark

Re: Random Forest on Spark

Re: Random Forest on Spark

Re: Random Forest on Spark

Re: Random Forest on Spark

Re: Random Forest on Spark

Re: Random Forest on Spark

Re: Random Forest on Spark

Re: Re: Random Forest on Spark

Re: Random Forest on Spark

Re: Random Forest on Spark

Re: Random Forest on Spark

Re: Random Forest on Spark

Re: Random Forest on Spark

Re: Random Forest on Spark

Re: Random Forest on Spark

Re: Random Forest on Spark

Re: Random Forest on Spark

Re: Random Forest on Spark

Re: Random Forest on Spark

Re: Random Forest on Spark

Re: Random Forest on Spark

Re: Random Forest on Spark

Re: Random Forest on Spark

Re: Random Forest on Spark

Re: Random Forest on Spark

27 matches

Site Navigation

Mail list logo

Footer information