Thanks to both of you for the comments and suggestions. Over the next couple of days I plan to work through my simple problem using the help offered in this forum.
________________________________ From: David Winsemius <dwinsem...@comcast.net> To: Bert Gunter <gunter.ber...@gene.com> Sent: Mon, December 21, 2009 2:31:26 PM Subject: Re: [R] Question About Repeat Random Sampling from a Data Frame On Dec 21, 2009, at 1:01 PM, Bert Gunter wrote: > Didn't read this thread in detail, so the following suggestion may just be > nonsense... (caveat emptor), but: > > To sample from an data frame or matrix, sample from the row indices and then > extract what you want from the sampled rows. Or sample directly from > individual columns if that suffices. In general, > > ?sample > > on appropriate indices of object in question. > > Bert Gunter > Genentech Nonclinical Biostatistics > > > -----Original Message----- > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On > Behalf Of Adam Carr > Sent: Monday, December 21, 2009 9:53 AM > To: David Winsemius > Cc: r-help@r-project.org > Subject: Re: [R] Question About Repeat Random Sampling from a Data Frame > > Good Afternoon Dr. Winsemius: > > You ask some very good questions and make excellent points; my responses are > below. I've tried to extract your questions and provide answers just to > reduce the clutter. > > 1. You might want to provide statistical justification for the otherwise > puzzling sampling strategy. > > I assume you mean my overall process of random sampling from a large data > set. The data set is comprised of observations collected over four years. > Although the basis for sampling would make a good four-frame Dilbert cartoon > if it could be condensed enough, my answer begins with the unfortunate truth > that there is a great divide between the technical and marketing groups at > the business where I am employed. Many powerful marketing executives, some > with technical backgrounds, feel that there is something fundamentally wrong > with the manufacturing process because the data generated over the long term > is not approximately normally distributed. My task was to examine this set > of data, trying to keep the representation of Y, N and F approximately equal > in the sample when compared to the large set, to determine if any subset > exhibits the holy grail-like normal distribution characteristics. I don't > feel that this is statistical justification, but it is the > reason why I am doing this. > > 2. It would help if you explained what you are attempting here in ordinary > English. There are 10 elements in mysamples, each of which is a 100 x 5 > dataframe, and mat is just one 100 x 5 matrix, which you seem to be > referencing incorrectly, given the fact that it has two, rather than one, > dimension. Furthermore, those dataframes may not be of a uniform class, > since you said you had character variable. Do you really want these all in a > character type matrix, which would be what is likely to happen given R's > requirement that matrix element be of only one class? What you say above > suggests not. > > It seems from your response that I incorrectly assumed that a list is not > the same as a data frame. I started down this path after reading the > questions and answers to a similar problem where the r-help responder > suggested a two step process and said that the list must be converted to > another form in order to be available for analysis. A data.frame is a special type of list. You can also make lists of dataframes (just as you can make lists of lists), which I thought the first portion of your code would have done: mysamples<-list() for (i in 1:10){ mysamples[[i]] <- dataset[ sample(1:1637,100, prob=c(rep(163.7/1637,513), rep(245.5/1637,197), rep(1227.8/1637,927)), replace = TRUE), ] Each element in that list would have been a subset of your larger data.frame and would itself have been a data.frame. > > And you are absolutely correct that I do not want each sample in a character > type matrix. > > In plain English, I hope, I am simply trying to iterate the process of > removing random samples from the large data set, and then saving these > samples in a format that is available for simple analysis. For example, if I > remove five hundred mysample sets, each of which is composed of a 100 x 5 > sample of the large data set I am interested in determining the skewness, > kurtosis, mean and standard deviation of each of the four numeric variables > in each of the five hundred mysample sets. So make a small dataframe with variables (columns) of the same type as in your real data, maybe 25-30 rows in "extent" (not "length", since for a dataframe, the length() function returns the number of columns). > > 3. Sorting out such problems is best done with smaller test objects. I was > surprised to see...type character. > > I agree. I began to do this with a small test data set but it was late last > evening and I realized that I should ask for help before proceeding on what > I thought might be incorrect assumptions. I clearly misunderstood that a > list needed to be converted to a data frame in order to be available for > analysis. Well, if each list element is already a data.frame then no conversions are needed. The lapply function can be used to "loop" over a list, and you can define a function that will only look at particular components of those elements. There are also functions in packages that automate the process. The describe function in Hmisc looksa t each column and decides what type it is > > Thank you for taking the time to respond. The discussion and suggestions are > very helpful. > > Adam > > ________________________________ > From: David Winsemius <dwinsem...@comcast.net> > > Cc: r-help@r-project.org > Sent: Mon, December 21, 2009 11:23:43 AM > Subject: Re: [R] Question About Repeat Random Sampling from a Data Frame > > > On Dec 21, 2009, at 10:12 AM, Adam Carr wrote: > >> Good Morning: >> >> I've read many, many posts on the r-help system and I feel compelled to > quickly admit that I am relatively new to R, I do have several reference > books around me, but I cannot count myself among the fortunate who seem to > strong programming intuition. >> >> I have a data set consisting of 1637 observations of five variables: > tensile strength, yield strength, elongation, hardness and a character > indicator with three levels: (Y)es, (N)o, and (F)ail. >> >> My objective is to randomly sample various subsets from this data set and > then evaluate these subsets using simple parameters among them tests for > normality, shape and skewness. The data set is ordered by the character > variable prior to sampling, and the samples are weighted to mirror > representation in an overall, physical process. >> >> I am sampling the data set using this code: >> >> sample <- dataset[sample(1:1637, 500, > prob=c(rep(163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace > = TRUE),] >> >> What I would like to do is iterate this process to create many (say 500 or > more) sampled sets of n=500 and then evaluate each set for the parameters of > interest. I would actually be evaluating each variable within each subset > for my characteristic of interest. I am familiar with sampling and saving > single columns of data to do this sort of thing, but I am not sure how to > accomplish this with a multiple-variable data set. >> >> For example, I am currently iterating this using a clunky process: >> >> mysamples<-list() >> for (i in 1:10){ >> mysamples[[i]] <- dataset[ > sample(1:1637,100,prob=c(rep(163.7/1637,513),rep(245.5/1637,197),rep(1227.8/ > 1637,927)),replace = TRUE), ] >> } >> > > Using lists to store intermediate results is not considered clunky in R. > (You might want to provide statistical justification for the otherwise > puzzling sampling strategy.) > >> But this leaves me with the additional task of defining each mysample[i] > iteration and converting it to a form on which I can apply a standard > statistical test like mean() or skewness() to the variable columns within > each subset. I have attempted to iteratively convert these lists using this > code: >> >> mat<-matrix(nrow=100,ncol=5) >> for (i in 1:length(mysamples)) >> {mat[i]<-do.call('rbind',mysamples[i])} > > It would help if you explained what you are attempting here in ordinary > English. There are 10 elements in mysamples, each of which is a 100 x 5 > dataframe, and mat is just one 100 x 5 matrix, which you seem to be > referencing incorrectly, given the fact that it has two, rather than one, > dimension. Furthermore, those dataframes may not be of a uniform class, > since you said you had character variable. Do you really want these all in a > character type matrix, which would be what is likely to happen given R's > requirement that matrix element be of only one class? What you say above > suggests not. > >> >> but running the code generates the error message: number of items to > replace is not a multiple of replacement length. > > Because of the way you are referencing the matrix, probably. If you wanted a > 10 x 100 x 5 array, then create an array. In R, as far as I can tell anyway, > matrices are necessarily of 2 dimensions. Tables and arrays can be of higher > dimension. > >> I have tried unsuccessfully, by reading many, many helpful r-help emails > on this error, to understand my probably obvious mistake. > > Sorting out such problems is best done with smaller test objects. I was > surprised to see that you thought it was necessary to convert dataframes to > matrices in order to calculate descriptive statistics. Nothing could be > farther from the truth. Furthermore, it for some other more valid reason you > wanted a list of matrices, there is a perfectly good function that will > convert a dataframe to a matrix, data.matrix(), remembering of course that > if there is a single character variable in the dataframe, that the entire > matrix will be of type character. >> >> Based on the small amount that I think I know about R it seems to me that > sampling the data frame and containing the samples in a list is likely a > pretty inefficient way to do this task. Any help that any of you could > provide to assist me in iteratively sampling the data frame, and storing the > samples in a form on which I can apply other statistical tests would be > greatly appreciated. >> >> Thank you very much for taking the time to consider my questions. > -- David Winsemius, MD Heritage Laboratories West Hartford, CT [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.