If I understand your situation correctly, you may be able to make use of the "strata" and "sampsize" arguments in randomForest() to get bootstrap samples that resemble the original data distribution. They allow you to specify stratified samples using the "strata" variable.
Best, Andy From: Raghu Naik > > Folks, > > I have a query around weighting in Random Forest (RF). I know > that several > earlier emails in this group have raised this issue, but I > did not find an > answer to my query. > > I am working on a dataset (dataset1) that consists of 4 > million records that > can be reduced to a dataset (dataset2) of approximately 1500 > unique records > with frequency counts that add up to the 4 million records > number as above. > Because of size issues, I cannot work with dataset1 in R and > therefore, I am > working with dataset2 . > > Each record consists of whether or not a patient chose a > particular drug > based on 14 comorbidity (Yes / No) variables; I am using RF > to understand > the comorbidity drivers of drug adoption (yes/no) classification. > > At full dataset level (dataset1), the drug adoption incidence > is ~11%. At > the reduced dataset dataset2 level, the drug adoption > incidence increases to > ~38%. > > My question is that, if am using the reduced dataset > (dataset2), how should > I inform RF that the adoption incidence at the full dataset > level was 11%. > Should that be used as a classwt prior with > classwt=c(Yes=.11, No=.89)? My > understanding is that RF does not allow case weighting. > Or can this be handled with the sampsize arguement through > oversampling? > What proportions should one use for this (e.g., sampsize=c(Yes=100, > No=100))? > > > > I would appreciate any feedback or pointers to any earlier > thread that I may > have overlooked. > > Regards, > > Raghu Notice: This e-mail message, together with any attachme...{{dropped:12}} ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.