>>>>> Richard O'Keefe >>>>> on Tue, 19 Oct 2021 14:22:53 +1300 writes:
> It *sounds* as though you are trying to impute missing data. > There are better approaches than just plugging in means. > You might want to look into CALIBERrfimpute or missForest. Yes, indeed! Put even more strongly: "Imputation" has been an important topic for decennia and it has been shown since the 1980s that plugging in columns means can be *very misleading* for everything you do later with that modified data set. The Wikipedia page is quite good as short intro https://en.wikipedia.org/wiki/Imputation_(statistics) When I've been teaching about this, I've strongly recommended multiple imputation and the "state-of-the-art" package 'mice' which comes with a really good text book: Stef van Buuren (2012) -- Flexible Imputation of Missing Data https://doi.org/10.1201/b11826 (= reference [12] in the Wikipedia article) where in the first chapter you see a nice example on how bad mean imputation typically will be .. The JSS paper on mice is a more technical (I'd say "to be used once you are already aware that 'mean imputation' should rarely be used): > citation(package="mice") To cite mice in publications use: Stef van Buuren, Karin Groothuis-Oudshoorn (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. URL https://www.jstatsoft.org/v45/i03/. Best regards, Martin Maechler ETH Zurich and R Core team > On Tue, 19 Oct 2021 at 01:39, Admire Tarisirayi Chirume > <atchir...@gmail.com> wrote: >> >> Good day colleagues. Below is a csv file attached which i am using in my >> > analysis. >> > >> > >> > >> > household.id <http://hh.id> >> > >> > hd17.perm >> > >> > hd17employ >> > >> > health.exp >> > >> > total.food.exp >> > >> > total.nfood.exp >> > >> > 1 >> > >> > 2 >> > >> > yes >> > >> > 1654 >> > >> > 23654 >> > >> > 23655 >> > >> > 2 >> > >> > 2 >> > >> > yes >> > >> > NA >> > >> > NA >> > >> > 65984 >> > >> > 3 >> > >> > 6 >> > >> > no >> > >> > 2547 >> > >> > 123311 >> > >> > 52416 >> > >> > 4 >> > >> > 8 >> > >> > NA >> > >> > 2365 >> > >> > 13648 >> > >> > 12544 >> > >> > 5 >> > >> > 6 >> > >> > NA >> > >> > 1254 >> > >> > 36549 >> > >> > 12365 >> > >> > 6 >> > >> > 8 >> > >> > yes >> > >> > 1236 >> > >> > 236541 >> > >> > 26522 >> > >> > 7 >> > >> > 8 >> > >> > no >> > >> > NA >> > >> > 13264 >> > >> > 23698 >> > >> > >> > >> > >> > >> > So I created a df using the above and its a csv file as follows >> > >> > wbpractice <- read.csv("world_practice.csv") >> > >> > Now i am doing data cleaning and trying to replace all missing values with >> > the averages of the respective columns. >> > >> > the dimension of the actual dataset is; >> > >> > dim(wbpractice) >> [1] 31998 6 >> >> I used the following script which i executed by i got some error messages >> >> for(i in 1:ncol( wbpractice )){ >> wbpractice [is.na( wbpractice [,i]), i] <- mean( wbpractice [,i], >> na.rm = TRUE) >> } >> >> Any help to replace all NAs with average values in my dataframe? >> ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.