Re: SparkML algos limitations question.

Joseph Bradley Mon, 21 Mar 2016 13:25:57 -0700

The indexing I mentioned is more restrictive than that: each index
corresponds to a unique position in a binary tree.  (I.e., the first index
of row 0 is 1, the first of row 1 is 2, the first of row 2 is 4, etc., IIRC)


You're correct that this restriction could be removed; with some careful
thought, we could probably avoid using indices altogether.  I just created
https://issues.apache.org/jira/browse/SPARK-14043  to track this.

On Mon, Mar 21, 2016 at 11:22 AM, Eugene Morozov <evgeny.a.moro...@gmail.com
> wrote:

> Hi, Joseph,
>
> I thought I understood, why it has a limit of 30 levels for decision tree,
> but now I'm not that sure. I thought that's because the decision tree
> stored in the array, which has length of type int, which cannot be more,
> than 2^31-1.
> But here are my new discoveries. I've trained two different random forest
> models of 50 trees and different maxDepth (20 and 30) and specified node
> size = 5. Here are couple of those trees
>
> Model with maxDepth = 20:
>     depth=20, numNodes=471
>     depth=19, numNodes=497
>
> Model with maxDepth = 30:
>     depth=30, numNodes=11347
>     depth=30, numNodes=10963
>
> It looks like the tree is not pretty balanced and I understand why that
> happens, but I'm surprised that actual number of nodes way less, than 2^31
> - 1. And now I'm not sure of why the limitation actually exists. With tree
> that consist of 2^31 nodes it'd required to have 8G of memory just to store
> those indexes, so I'd say that depth isn't the biggest issue in such a
> case.
>
> Is it possible to workaround or simply miss maxDepth limitation (without
> codebase modification) to train the tree until I hit the max number of
> nodes? I'd assume that in most cases I simply won't hit it, but the depth
> of the tree would be much more, than 30.
>
>
> --
> Be well!
> Jean Morozov
>
> On Wed, Dec 16, 2015 at 1:00 AM, Joseph Bradley <jos...@databricks.com>
> wrote:
>
>> Hi Eugene,
>>
>> The maxDepth parameter exists because the implementation uses Integer
>> node IDs which correspond to positions in the binary tree.  This simplified
>> the implementation.  I'd like to eventually modify it to avoid depending on
>> tree node IDs, but that is not yet on the roadmap.
>>
>> There is not an analogous limit for the GLMs you listed, but I'm not very
>> familiar with the perceptron implementation.
>>
>> Joseph
>>
>> On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov <
>> evgeny.a.moro...@gmail.com> wrote:
>>
>>> Hello!
>>>
>>> I'm currently working on POC and try to use Random Forest
>>> (classification and regression). I also have to check SVM and Multiclass
>>> perceptron (other algos are less important at the moment). So far I've
>>> discovered that Random Forest has a limitation of maxDepth for trees and
>>> just out of curiosity I wonder why such a limitation has been introduced?
>>>
>>> An actual question is that I'm going to use Spark ML in production next
>>> year and would like to know if there are other limitations like maxDepth in
>>> RF for other algorithms: Logistic Regression, Perceptron, SVM, etc.
>>>
>>> Thanks in advance for your time.
>>> --
>>> Be well!
>>> Jean Morozov
>>>
>>
>>
>

Re: SparkML algos limitations question.

Reply via email to