Hi, I have asked this question elsewhere however failed to get any response, so hoping to get some insight from experts and statisticians here.
Let say we are fitting a regression equation where one explanatory variable is categorical with 2 categories. However in the sample, one category has 95% of values but other category has just 5%. Means, the categories are highly unbalanced. Typically SE of estimate may be inflated for such highly unbalanced categorical explanatory variable. Such unbalanced case may come from 2 scenarios 1) there is a flaw in sample or it is just by chance that second category has just 5% values in the sample or 2) in the population itself, the second category has very small number of occurrences which is reflected in the sample. My question how the SE would be impacted in above 2 cases? Will the impact be same i.e. we would get incorrect estimate of SE in both cases? If yes, is there any way to prove analytically or may be based on simulation? My apologies as this question is not directly R related. However I just wanted to get some insight on above problem related to Statistics from some of the great Statisticians in this forum. Thanks for your time. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide https://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.