The indexing I mentioned is more restrictive than that: each index corresponds to a unique position in a binary tree. (I.e., the first index of row 0 is 1, the first of row 1 is 2, the first of row 2 is 4, etc., IIRC)
You're correct that this restriction could be removed; with some careful thought, we could probably avoid using indices altogether. I just created https://issues.apache.org/jira/browse/SPARK-14043 to track this. On Mon, Mar 21, 2016 at 11:22 AM, Eugene Morozov <evgeny.a.moro...@gmail.com > wrote: > Hi, Joseph, > > I thought I understood, why it has a limit of 30 levels for decision tree, > but now I'm not that sure. I thought that's because the decision tree > stored in the array, which has length of type int, which cannot be more, > than 2^31-1. > But here are my new discoveries. I've trained two different random forest > models of 50 trees and different maxDepth (20 and 30) and specified node > size = 5. Here are couple of those trees > > Model with maxDepth = 20: > depth=20, numNodes=471 > depth=19, numNodes=497 > > Model with maxDepth = 30: > depth=30, numNodes=11347 > depth=30, numNodes=10963 > > It looks like the tree is not pretty balanced and I understand why that > happens, but I'm surprised that actual number of nodes way less, than 2^31 > - 1. And now I'm not sure of why the limitation actually exists. With tree > that consist of 2^31 nodes it'd required to have 8G of memory just to store > those indexes, so I'd say that depth isn't the biggest issue in such a > case. > > Is it possible to workaround or simply miss maxDepth limitation (without > codebase modification) to train the tree until I hit the max number of > nodes? I'd assume that in most cases I simply won't hit it, but the depth > of the tree would be much more, than 30. > > > -- > Be well! > Jean Morozov > > On Wed, Dec 16, 2015 at 1:00 AM, Joseph Bradley <jos...@databricks.com> > wrote: > >> Hi Eugene, >> >> The maxDepth parameter exists because the implementation uses Integer >> node IDs which correspond to positions in the binary tree. This simplified >> the implementation. I'd like to eventually modify it to avoid depending on >> tree node IDs, but that is not yet on the roadmap. >> >> There is not an analogous limit for the GLMs you listed, but I'm not very >> familiar with the perceptron implementation. >> >> Joseph >> >> On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov < >> evgeny.a.moro...@gmail.com> wrote: >> >>> Hello! >>> >>> I'm currently working on POC and try to use Random Forest >>> (classification and regression). I also have to check SVM and Multiclass >>> perceptron (other algos are less important at the moment). So far I've >>> discovered that Random Forest has a limitation of maxDepth for trees and >>> just out of curiosity I wonder why such a limitation has been introduced? >>> >>> An actual question is that I'm going to use Spark ML in production next >>> year and would like to know if there are other limitations like maxDepth in >>> RF for other algorithms: Logistic Regression, Perceptron, SVM, etc. >>> >>> Thanks in advance for your time. >>> -- >>> Be well! >>> Jean Morozov >>> >> >> >