Hi Sumona

I’m afraid I never really resolved the issue. Actually I have just had to 
rollback an upgrade from 2.1.0 to 2.1.1 because it (for reasons unknown) 
reintroduced the issue in our nightly integration tests (see 
http://apache-spark-user-list.1001560.n3.nabble.com/Issue-upgrading-to-Spark-2-1-1-from-2-1-0-tc28660.html
 
<http://apache-spark-user-list.1001560.n3.nabble.com/Issue-upgrading-to-Spark-2-1-1-from-2-1-0-tc28660.html>)

The “solution” for me at the time was to wave my magic Spark wand and hope for 
the best. That generally means 

- trying increasing memory or reducing amount of memory required (smaller 
datasets, lower sample rate, more partitions, less caching) 
- performing random changes to various other parts of the pipeline, including 
SQL statements and adding/removing stuff such as repartition/coalesce
- flipping around with various spark configuration settings

In the specific case I think it was the subsampling rate that did the trick.

I find issues such as this one extremely demanding to debug because they can 
generally not be reproduced locally. I guess you basically need to build Spark 
yourself with appropriate instrumentation added, and even this would probably 
require a very deep insight into Sparks guts.

Hanging threads are in my opinion the worst possible behaviour of a program, so 
if anyone can shed some light on this or provide any debugging hints it would 
be amazing.

Morten


> Den 30. maj 2017 kl. 19.29 skrev Sumona Routh <sumos...@gmail.com>:
> 
> Hi Morten,
> Were you able to resolve your issue with RandomForest? I am having similar 
> issues with a newly trained model (that does have larger number of trees, 
> smaller minInstancesPerNode, which is by design to produce the best 
> performing model). 
> 
> I wanted to get some feedback on how you solved your issue before I post a 
> separate question.
> 
> Thanks!
> Sumona
> 
> On Sun, Dec 11, 2016 at 4:10 AM Marco Mistroni <mmistr...@gmail.com 
> <mailto:mmistr...@gmail.com>> wrote:
> OK. Did u change spark version? Java/scala/python version? 
> Have u tried with different versions of any of the above?
> Hope this helps 
> Kr
> 
> On 10 Dec 2016 10:37 pm, "Morten Hornbech" <mor...@datasolvr.com 
> <mailto:mor...@datasolvr.com>> wrote:
> I haven’t actually experienced any non-determinism. We have nightly 
> integration tests comparing output from random forests with no variations.
> 
> The workaround we will probably try is to split the dataset, either randomly 
> or on one of the variables, and then train a forest on each partition, which 
> should then be sufficiently small.
> 
> I hope to be able to provide a good repro case in some weeks. If the problem 
> was in our own code I will also post it in this thread.
> 
> Morten
> 
>> Den 10. dec. 2016 kl. 23.25 skrev Marco Mistroni <mmistr...@gmail.com 
>> <mailto:mmistr...@gmail.com>>:
>> 
>> Hello Morten
>> ok.
>> afaik there is a tiny bit of randomness in these ML algorithms (pls anyone 
>> correct me if i m wrong).
>> In fact if you run your RDF code multiple times, it will not give you 
>> EXACTLY the same results (though accuracy and errors should me more or less 
>> similar)..at least this is what i found when playing around with 
>> RDF and decision trees and other ML algorithms
>> 
>> If RDF is not a must for your usecase, could you try 'scale back' to 
>> Decision Trees and see if you still get intermittent failures?
>> this at least to exclude issues with the data
>> 
>> hth 
>>  marco
>> 
>> On Sat, Dec 10, 2016 at 5:20 PM, Morten Hornbech <mor...@datasolvr.com 
>> <mailto:mor...@datasolvr.com>> wrote:
>> Already did. There are no issues with smaller samples. I am running this in 
>> a cluster of three t2.large instances on aws.
>> 
>> I have tried to find the threshold where the error occurs, but it is not a 
>> single factor causing it. Input size and subsampling rate seems to be most 
>> significant, and number of trees the least.
>> 
>> I have also tried running on a test frame of randomized numbers with the 
>> same number of rows, and could not reproduce the problem here.
>> 
>> By the way maxDepth is 5 and maxBins is 32.
>> 
>> I will probably need to leave this for a few weeks to focus on more 
>> short-term stuff, but I will write here if I solve it or reproduce it more 
>> consistently.
>> 
>> Morten
>> 
>>> Den 10. dec. 2016 kl. 17.29 skrev Marco Mistroni <mmistr...@gmail.com 
>>> <mailto:mmistr...@gmail.com>>:
>>> 
>>> Hi
>>>  Bring back samples to 1k range to debug....or as suggested reduce tree and 
>>> bins.... had rdd running on same size data with no issues.....or send me 
>>> some sample code and data and I try it out on my ec2 instance ...
>>> Kr
>>> 
>>> On 10 Dec 2016 3:16 am, "Md. Rezaul Karim" <rezaul.ka...@insight-centre.org 
>>> <mailto:rezaul.ka...@insight-centre.org>> wrote:
>>> I had similar experience last week. Even I could not find any error trace. 
>>> 
>>> Later on, I did the following to get rid of the problem: 
>>> i) I downgraded to Spark 2.0.0 
>>> ii) Decreased the value of maxBins and maxDepth 
>>> 
>>> Additionally, make sure that you set the featureSubsetStrategy as "auto" to 
>>> let the algorithm choose the best feature subset strategy for your data. 
>>> Finally, set the impurity as "gini" for the information gain.
>>> 
>>> However, setting the value of no. of trees to just 1 does not give you 
>>> either real advantage of the forest neither better predictive performance. 
>>> 
>>> 
>>> 
>>> Best, 
>>> Karim 
>>> 
>>> 
>>> On Dec 9, 2016 11:29 PM, "mhornbech" <mor...@datasolvr.com 
>>> <mailto:mor...@datasolvr.com>> wrote:
>>> Hi
>>> 
>>> I have spent quite some time trying to debug an issue with the Random Forest
>>> algorithm on Spark 2.0.2. The input dataset is relatively large at around
>>> 600k rows and 200MB, but I use subsampling to make each tree manageable.
>>> However even with only 1 tree and a low sample rate of 0.05 the job hangs at
>>> one of the final stages (see attached). I have checked the logs on all
>>> executors and the driver and find no traces of error. Could it be a memory
>>> issue even though no error appears? The error does seem sporadic to some
>>> extent so I also wondered whether it could be a data issue, that only occurs
>>> if the subsample includes the bad data rows.
>>> 
>>> Please comment if you have a clue.
>>> 
>>> Morten
>>> 
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n28192/Sk%C3%A6rmbillede_2016-12-10_kl.png
>>>  
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n28192/Sk%C3%A6rmbillede_2016-12-10_kl.png>>
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: 
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-hangs-without-trace-of-error-tp28192.html
>>>  
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-hangs-without-trace-of-error-tp28192.html>
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com 
>>> <http://nabble.com/>.
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>>> <mailto:user-unsubscr...@spark.apache.org>
>>> 
>> 
>> 
> 

Reply via email to