[GitHub] spark pull request: [SPARK-3366][MLLIB]Compute best splits distrib...

mengxr Thu, 09 Oct 2014 01:10:56 -0700

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2595#issuecomment-58476229
  
    @jkbradley Thanks for running the experiments! It is clear that the 
regression happens when the shuffle size is not large enough to make dist-agg 
faster than tree-agg, in particular, in the cases of shallow levels, small 
number of features, or small number of trees.
    
    So the question becomes what is the problem scale we really want to solve 
in practice. If we train a single tree, is depth 5 good enough in most cases 
(including boosting)? If we use random forest with SQRT, would 5 trees be good 
enough? It would be really helpful if we can find some references. Then let's 
decide whether we want to keep both approaches.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-3366][MLLIB]Compute best splits distrib...

Reply via email to