# Perhaps I misunderstand your original need, but....
## I added a few lines to your data and used dput() to get the below data (I named "df") df<- structure(list(age = c(15L, 20L, 15L, 10L, 10L, 12L, 17L, 17L, 11L, 12L, 16L, 20L, 23L, 14L, 22L, 16L, 10L, 11L, 21L, 10L, 13L, 17L), sex = structure(c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L), .Label = c("f", "m"), class = "factor"), class = structure(c(2L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L), .Label = c("high", "low"), class = "factor")), .Names = c("age", "sex", "class"), class = "data.frame", row.names = c(NA, -22L )) ## the following line uses which(), sample(), and rbind(), along with some indexing to get a new dataframe; see ?which, ?sample, and ?rbind for more info # For the "indexing", play with it, ... type in df[1:3,1:2] as an example new_df <- rbind(df[sample(which(df$class=="low"), 4),], df[sample(which(df$class=="high"), 4),]) Now replace 4 with the the size of each you want. hgwelec wrote: > > Thank you for your answer. > > The problem is that i am learning R now, so i do not know how i could do > this. > > > I have found the following code but it does not work unfortunately > (=create distribution 0.1 "low" class - 0.9 high) : > > > > data[c(rownames(data.df[data.df$class=="high",]), > sample(rownames(data[data.df$class=="low"]), 0.1)) , ] > 2 posts This post has NOT been accepted by the mailing list yet. Dear members, Consider the following data frame (first 4 rows shown) age sex class 15 m low 20 f high 15 f low 10 m low in my original data set i have 1200 rows and a class distribution of low=0.3 and high=0.7 My question : how can i create a new data frame as the one shown above but with the 'high' class subsampled so that in the new data frame the class distribution is low=0.5 and high=0.5? I tried looking at the sample function and prob option but all examples i seen do not use an imbalanced class problem as the one shown above Thank you in advance Thank you in advance -- View this message in context: http://r.789695.n4.nabble.com/Subsampling-oversampling-from-a-data-frame-tp3965771p3971840.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.