Hi Morten, Were you able to resolve your issue with RandomForest? I am having similar issues with a newly trained model (that does have larger number of trees, smaller minInstancesPerNode, which is by design to produce the best performing model).
I wanted to get some feedback on how you solved your issue before I post a separate question. Thanks! Sumona On Sun, Dec 11, 2016 at 4:10 AM Marco Mistroni <mmistr...@gmail.com> wrote: > OK. Did u change spark version? Java/scala/python version? > Have u tried with different versions of any of the above? > Hope this helps > Kr > > On 10 Dec 2016 10:37 pm, "Morten Hornbech" <mor...@datasolvr.com> wrote: > >> I haven’t actually experienced any non-determinism. We have nightly >> integration tests comparing output from random forests with no variations. >> >> The workaround we will probably try is to split the dataset, either >> randomly or on one of the variables, and then train a forest on each >> partition, which should then be sufficiently small. >> >> I hope to be able to provide a good repro case in some weeks. If the >> problem was in our own code I will also post it in this thread. >> >> Morten >> >> Den 10. dec. 2016 kl. 23.25 skrev Marco Mistroni <mmistr...@gmail.com>: >> >> Hello Morten >> ok. >> afaik there is a tiny bit of randomness in these ML algorithms (pls >> anyone correct me if i m wrong). >> In fact if you run your RDF code multiple times, it will not give you >> EXACTLY the same results (though accuracy and errors should me more or less >> similar)..at least this is what i found when playing around with >> RDF and decision trees and other ML algorithms >> >> If RDF is not a must for your usecase, could you try 'scale back' to >> Decision Trees and see if you still get intermittent failures? >> this at least to exclude issues with the data >> >> hth >> marco >> >> On Sat, Dec 10, 2016 at 5:20 PM, Morten Hornbech <mor...@datasolvr.com> >> wrote: >> >>> Already did. There are no issues with smaller samples. I am running this >>> in a cluster of three t2.large instances on aws. >>> >>> I have tried to find the threshold where the error occurs, but it is not >>> a single factor causing it. Input size and subsampling rate seems to be >>> most significant, and number of trees the least. >>> >>> I have also tried running on a test frame of randomized numbers with the >>> same number of rows, and could not reproduce the problem here. >>> >>> By the way maxDepth is 5 and maxBins is 32. >>> >>> I will probably need to leave this for a few weeks to focus on more >>> short-term stuff, but I will write here if I solve it or reproduce it more >>> consistently. >>> >>> Morten >>> >>> Den 10. dec. 2016 kl. 17.29 skrev Marco Mistroni <mmistr...@gmail.com>: >>> >>> Hi >>> Bring back samples to 1k range to debug....or as suggested reduce tree >>> and bins.... had rdd running on same size data with no issues.....or send >>> me some sample code and data and I try it out on my ec2 instance ... >>> Kr >>> >>> On 10 Dec 2016 3:16 am, "Md. Rezaul Karim" < >>> rezaul.ka...@insight-centre.org> wrote: >>> >>>> I had similar experience last week. Even I could not find any error >>>> trace. >>>> >>>> Later on, I did the following to get rid of the problem: >>>> i) I downgraded to Spark 2.0.0 >>>> ii) Decreased the value of maxBins and maxDepth >>>> >>>> Additionally, make sure that you set the featureSubsetStrategy as >>>> "auto" to let the algorithm choose the best feature subset strategy >>>> for your data. Finally, set the impurity as "gini" for the information >>>> gain. >>>> >>>> However, setting the value of no. of trees to just 1 does not give you >>>> either real advantage of the forest neither better predictive performance. >>>> >>>> >>>> >>>> Best, >>>> Karim >>>> >>>> >>>> On Dec 9, 2016 11:29 PM, "mhornbech" <mor...@datasolvr.com> wrote: >>>> >>>>> Hi >>>>> >>>>> I have spent quite some time trying to debug an issue with the Random >>>>> Forest >>>>> algorithm on Spark 2.0.2. The input dataset is relatively large at >>>>> around >>>>> 600k rows and 200MB, but I use subsampling to make each tree >>>>> manageable. >>>>> However even with only 1 tree and a low sample rate of 0.05 the job >>>>> hangs at >>>>> one of the final stages (see attached). I have checked the logs on all >>>>> executors and the driver and find no traces of error. Could it be a >>>>> memory >>>>> issue even though no error appears? The error does seem sporadic to >>>>> some >>>>> extent so I also wondered whether it could be a data issue, that only >>>>> occurs >>>>> if the subsample includes the bad data rows. >>>>> >>>>> Please comment if you have a clue. >>>>> >>>>> Morten >>>>> >>>>> < >>>>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n28192/Sk%C3%A6rmbillede_2016-12-10_kl.png >>>>> > >>>>> >>>>> >>>>> >>>>> -- >>>>> View this message in context: >>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-hangs-without-trace-of-error-tp28192.html >>>>> Sent from the Apache Spark User List mailing list archive at >>>>> Nabble.com <http://nabble.com/>. >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>> >>>>> >>> >> >>