Hi Sumona I’m afraid I never really resolved the issue. Actually I have just had to rollback an upgrade from 2.1.0 to 2.1.1 because it (for reasons unknown) reintroduced the issue in our nightly integration tests (see http://apache-spark-user-list.1001560.n3.nabble.com/Issue-upgrading-to-Spark-2-1-1-from-2-1-0-tc28660.html <http://apache-spark-user-list.1001560.n3.nabble.com/Issue-upgrading-to-Spark-2-1-1-from-2-1-0-tc28660.html>)
The “solution” for me at the time was to wave my magic Spark wand and hope for the best. That generally means - trying increasing memory or reducing amount of memory required (smaller datasets, lower sample rate, more partitions, less caching) - performing random changes to various other parts of the pipeline, including SQL statements and adding/removing stuff such as repartition/coalesce - flipping around with various spark configuration settings In the specific case I think it was the subsampling rate that did the trick. I find issues such as this one extremely demanding to debug because they can generally not be reproduced locally. I guess you basically need to build Spark yourself with appropriate instrumentation added, and even this would probably require a very deep insight into Sparks guts. Hanging threads are in my opinion the worst possible behaviour of a program, so if anyone can shed some light on this or provide any debugging hints it would be amazing. Morten > Den 30. maj 2017 kl. 19.29 skrev Sumona Routh <sumos...@gmail.com>: > > Hi Morten, > Were you able to resolve your issue with RandomForest? I am having similar > issues with a newly trained model (that does have larger number of trees, > smaller minInstancesPerNode, which is by design to produce the best > performing model). > > I wanted to get some feedback on how you solved your issue before I post a > separate question. > > Thanks! > Sumona > > On Sun, Dec 11, 2016 at 4:10 AM Marco Mistroni <mmistr...@gmail.com > <mailto:mmistr...@gmail.com>> wrote: > OK. Did u change spark version? Java/scala/python version? > Have u tried with different versions of any of the above? > Hope this helps > Kr > > On 10 Dec 2016 10:37 pm, "Morten Hornbech" <mor...@datasolvr.com > <mailto:mor...@datasolvr.com>> wrote: > I haven’t actually experienced any non-determinism. We have nightly > integration tests comparing output from random forests with no variations. > > The workaround we will probably try is to split the dataset, either randomly > or on one of the variables, and then train a forest on each partition, which > should then be sufficiently small. > > I hope to be able to provide a good repro case in some weeks. If the problem > was in our own code I will also post it in this thread. > > Morten > >> Den 10. dec. 2016 kl. 23.25 skrev Marco Mistroni <mmistr...@gmail.com >> <mailto:mmistr...@gmail.com>>: >> >> Hello Morten >> ok. >> afaik there is a tiny bit of randomness in these ML algorithms (pls anyone >> correct me if i m wrong). >> In fact if you run your RDF code multiple times, it will not give you >> EXACTLY the same results (though accuracy and errors should me more or less >> similar)..at least this is what i found when playing around with >> RDF and decision trees and other ML algorithms >> >> If RDF is not a must for your usecase, could you try 'scale back' to >> Decision Trees and see if you still get intermittent failures? >> this at least to exclude issues with the data >> >> hth >> marco >> >> On Sat, Dec 10, 2016 at 5:20 PM, Morten Hornbech <mor...@datasolvr.com >> <mailto:mor...@datasolvr.com>> wrote: >> Already did. There are no issues with smaller samples. I am running this in >> a cluster of three t2.large instances on aws. >> >> I have tried to find the threshold where the error occurs, but it is not a >> single factor causing it. Input size and subsampling rate seems to be most >> significant, and number of trees the least. >> >> I have also tried running on a test frame of randomized numbers with the >> same number of rows, and could not reproduce the problem here. >> >> By the way maxDepth is 5 and maxBins is 32. >> >> I will probably need to leave this for a few weeks to focus on more >> short-term stuff, but I will write here if I solve it or reproduce it more >> consistently. >> >> Morten >> >>> Den 10. dec. 2016 kl. 17.29 skrev Marco Mistroni <mmistr...@gmail.com >>> <mailto:mmistr...@gmail.com>>: >>> >>> Hi >>> Bring back samples to 1k range to debug....or as suggested reduce tree and >>> bins.... had rdd running on same size data with no issues.....or send me >>> some sample code and data and I try it out on my ec2 instance ... >>> Kr >>> >>> On 10 Dec 2016 3:16 am, "Md. Rezaul Karim" <rezaul.ka...@insight-centre.org >>> <mailto:rezaul.ka...@insight-centre.org>> wrote: >>> I had similar experience last week. Even I could not find any error trace. >>> >>> Later on, I did the following to get rid of the problem: >>> i) I downgraded to Spark 2.0.0 >>> ii) Decreased the value of maxBins and maxDepth >>> >>> Additionally, make sure that you set the featureSubsetStrategy as "auto" to >>> let the algorithm choose the best feature subset strategy for your data. >>> Finally, set the impurity as "gini" for the information gain. >>> >>> However, setting the value of no. of trees to just 1 does not give you >>> either real advantage of the forest neither better predictive performance. >>> >>> >>> >>> Best, >>> Karim >>> >>> >>> On Dec 9, 2016 11:29 PM, "mhornbech" <mor...@datasolvr.com >>> <mailto:mor...@datasolvr.com>> wrote: >>> Hi >>> >>> I have spent quite some time trying to debug an issue with the Random Forest >>> algorithm on Spark 2.0.2. The input dataset is relatively large at around >>> 600k rows and 200MB, but I use subsampling to make each tree manageable. >>> However even with only 1 tree and a low sample rate of 0.05 the job hangs at >>> one of the final stages (see attached). I have checked the logs on all >>> executors and the driver and find no traces of error. Could it be a memory >>> issue even though no error appears? The error does seem sporadic to some >>> extent so I also wondered whether it could be a data issue, that only occurs >>> if the subsample includes the bad data rows. >>> >>> Please comment if you have a clue. >>> >>> Morten >>> >>> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n28192/Sk%C3%A6rmbillede_2016-12-10_kl.png >>> >>> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n28192/Sk%C3%A6rmbillede_2016-12-10_kl.png>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-hangs-without-trace-of-error-tp28192.html >>> >>> <http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-hangs-without-trace-of-error-tp28192.html> >>> Sent from the Apache Spark User List mailing list archive at Nabble.com >>> <http://nabble.com/>. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> <mailto:user-unsubscr...@spark.apache.org> >>> >> >> >