Hi Filipus,
The train data is already oversampled.
The number of positives I mentioned above is for the test dataset : 12028
(apologies for not making this clear earlier)
The train dataset has 61,264 positives out of 689,763 total rows. The
number of negatives is 628,499.
Oversampling was done for the train dataset to ensure that we have atleast
9-10% of positives in the train part
No oversampling is done for the test dataset.

So, the only difference that remains is the amount of data used for
building a tree.

But, I have a few more questions :
Have we tried how much data can be used at most to build a single Decision
Tree.
Since, I have enough RAM to fit all the data into memory(only 1.3 GB of
train data and 30x3 GB of RAM), I would expect it to build a single
Decision Tree with all the data without any issues. But, for maxDepth >= 5,
it is not able to. I confirmed that when it keeps running for hours, the
amount of free memory available is more than 70%. So, it doesn't seem to be
a Memory issue either.


Thanks and Regards,
Suraj Sheth


On Wed, Jun 11, 2014 at 10:19 PM, filipus <floe...@gmail.com> wrote:

> well I guess your problem is quite unbalanced and due to the information
> value as a splitting criterion I guess the algo stops after very view
> splits
>
> work arround is oversampling
>
> build many training datasets like
>
> take randomly 50% of the positives and from the negativ the same amount or
> let say the double
>
> => 6000 positives and 12000 negatives
>
> build a tree
>
> this you do many times => many models (agents)
>
> and than you make an ensemble model. means vote all the model
>
> in a way similar two random forest but at the completely different
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Decision-Tree-not-getting-built-for-5-or-more-levels-maxDepth-5-and-the-one-built-for-3-levelsy-tp7401p7405.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to