Hi Morten,
Were you able to resolve your issue with RandomForest? I am having similar
issues with a newly trained model (that does have larger number of trees,
smaller minInstancesPerNode, which is by design to produce the best
performing model).

I wanted to get some feedback on how you solved your issue before I post a
separate question.

Thanks!
Sumona

On Sun, Dec 11, 2016 at 4:10 AM Marco Mistroni <mmistr...@gmail.com> wrote:

> OK. Did u change spark version? Java/scala/python version?
> Have u tried with different versions of any of the above?
> Hope this helps
> Kr
>
> On 10 Dec 2016 10:37 pm, "Morten Hornbech" <mor...@datasolvr.com> wrote:
>
>> I haven’t actually experienced any non-determinism. We have nightly
>> integration tests comparing output from random forests with no variations.
>>
>> The workaround we will probably try is to split the dataset, either
>> randomly or on one of the variables, and then train a forest on each
>> partition, which should then be sufficiently small.
>>
>> I hope to be able to provide a good repro case in some weeks. If the
>> problem was in our own code I will also post it in this thread.
>>
>> Morten
>>
>> Den 10. dec. 2016 kl. 23.25 skrev Marco Mistroni <mmistr...@gmail.com>:
>>
>> Hello Morten
>> ok.
>> afaik there is a tiny bit of randomness in these ML algorithms (pls
>> anyone correct me if i m wrong).
>> In fact if you run your RDF code multiple times, it will not give you
>> EXACTLY the same results (though accuracy and errors should me more or less
>> similar)..at least this is what i found when playing around with
>> RDF and decision trees and other ML algorithms
>>
>> If RDF is not a must for your usecase, could you try 'scale back' to
>> Decision Trees and see if you still get intermittent failures?
>> this at least to exclude issues with the data
>>
>> hth
>>  marco
>>
>> On Sat, Dec 10, 2016 at 5:20 PM, Morten Hornbech <mor...@datasolvr.com>
>> wrote:
>>
>>> Already did. There are no issues with smaller samples. I am running this
>>> in a cluster of three t2.large instances on aws.
>>>
>>> I have tried to find the threshold where the error occurs, but it is not
>>> a single factor causing it. Input size and subsampling rate seems to be
>>> most significant, and number of trees the least.
>>>
>>> I have also tried running on a test frame of randomized numbers with the
>>> same number of rows, and could not reproduce the problem here.
>>>
>>> By the way maxDepth is 5 and maxBins is 32.
>>>
>>> I will probably need to leave this for a few weeks to focus on more
>>> short-term stuff, but I will write here if I solve it or reproduce it more
>>> consistently.
>>>
>>> Morten
>>>
>>> Den 10. dec. 2016 kl. 17.29 skrev Marco Mistroni <mmistr...@gmail.com>:
>>>
>>> Hi
>>>  Bring back samples to 1k range to debug....or as suggested reduce tree
>>> and bins.... had rdd running on same size data with no issues.....or send
>>> me some sample code and data and I try it out on my ec2 instance ...
>>> Kr
>>>
>>> On 10 Dec 2016 3:16 am, "Md. Rezaul Karim" <
>>> rezaul.ka...@insight-centre.org> wrote:
>>>
>>>> I had similar experience last week. Even I could not find any error
>>>> trace.
>>>>
>>>> Later on, I did the following to get rid of the problem:
>>>> i) I downgraded to Spark 2.0.0
>>>> ii) Decreased the value of maxBins and maxDepth
>>>>
>>>> Additionally, make sure that you set the featureSubsetStrategy as
>>>> "auto" to let the algorithm choose the best feature subset strategy
>>>> for your data. Finally, set the impurity as "gini" for the information
>>>> gain.
>>>>
>>>> However, setting the value of no. of trees to just 1 does not give you
>>>> either real advantage of the forest neither better predictive performance.
>>>>
>>>>
>>>>
>>>> Best,
>>>> Karim
>>>>
>>>>
>>>> On Dec 9, 2016 11:29 PM, "mhornbech" <mor...@datasolvr.com> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> I have spent quite some time trying to debug an issue with the Random
>>>>> Forest
>>>>> algorithm on Spark 2.0.2. The input dataset is relatively large at
>>>>> around
>>>>> 600k rows and 200MB, but I use subsampling to make each tree
>>>>> manageable.
>>>>> However even with only 1 tree and a low sample rate of 0.05 the job
>>>>> hangs at
>>>>> one of the final stages (see attached). I have checked the logs on all
>>>>> executors and the driver and find no traces of error. Could it be a
>>>>> memory
>>>>> issue even though no error appears? The error does seem sporadic to
>>>>> some
>>>>> extent so I also wondered whether it could be a data issue, that only
>>>>> occurs
>>>>> if the subsample includes the bad data rows.
>>>>>
>>>>> Please comment if you have a clue.
>>>>>
>>>>> Morten
>>>>>
>>>>> <
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n28192/Sk%C3%A6rmbillede_2016-12-10_kl.png
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-hangs-without-trace-of-error-tp28192.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com <http://nabble.com/>.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>
>>>>>
>>>
>>
>>

Reply via email to