Re: [R] Question About Repeat Random Sampling from a Data Frame

Adam Carr Tue, 22 Dec 2009 03:49:36 -0800

Thanks to both of you for the comments and suggestions. Over the next couple of 
days I plan to work through my simple problem using the help offered in this 
forum.





________________________________
From: David Winsemius <dwinsem...@comcast.net>
To: Bert Gunter <gunter.ber...@gene.com>

Sent: Mon, December 21, 2009 2:31:26 PM
Subject: Re: [R] Question About Repeat Random Sampling from a Data Frame


On Dec 21, 2009, at 1:01 PM, Bert Gunter wrote:

> Didn't read this thread in detail, so the following suggestion may just be
> nonsense... (caveat emptor), but:
> 
> To sample from an data frame or matrix, sample from the row indices and then
> extract what you want from the sampled rows. Or sample directly from
> individual columns if that suffices. In general,
> 
> ?sample
> 
> on appropriate indices of object in question.
> 
> Bert Gunter
> Genentech Nonclinical Biostatistics
> 
> 
> -----Original Message-----
> From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
> Behalf Of Adam Carr
> Sent: Monday, December 21, 2009 9:53 AM
> To: David Winsemius
> Cc: r-help@r-project.org
> Subject: Re: [R] Question About Repeat Random Sampling from a Data Frame
> 
> Good Afternoon Dr. Winsemius:
> 
> You ask some very good questions and make excellent points; my responses are
> below. I've tried to extract your questions and provide answers just to
> reduce the clutter.
> 
> 1. You might want to provide statistical justification for the otherwise
> puzzling sampling strategy.
> 
> I assume you mean my overall process of random sampling from a large data
> set. The data set is comprised of observations collected over four years.
> Although the basis for sampling would make a good four-frame Dilbert cartoon
> if it could be condensed enough, my answer begins with the unfortunate truth
> that there is a great divide between the technical and marketing groups at
> the business where I am employed. Many powerful marketing executives, some
> with technical backgrounds, feel that there is something fundamentally wrong
> with the manufacturing process because the data generated over the long term
> is not approximately normally distributed. My task was to examine this set
> of data, trying to keep the representation of Y, N and F approximately equal
> in the sample when compared to the large set, to determine if any subset
> exhibits the holy grail-like normal distribution characteristics. I don't
> feel that this is statistical justification, but it is the
> reason why I am doing this.
> 
> 2. It would help if you explained what you are attempting here in ordinary
> English. There are 10 elements in mysamples, each of which is a 100 x 5
> dataframe, and mat is just one 100 x 5 matrix, which you seem to be
> referencing incorrectly, given the fact that it has two, rather than one,
> dimension. Furthermore, those dataframes may not be of a uniform class,
> since you said you had character variable. Do you really want these all in a
> character type matrix, which would be what is likely to happen given R's
> requirement that matrix element be of only one class? What you say above
> suggests not.
> 
> It seems from your response that I incorrectly assumed that a list is not
> the same as a data frame. I started down this path after reading the
> questions and answers to a similar problem where the r-help responder
> suggested a two step process and said that the list must be converted to
> another form in order to be available for analysis.

A data.frame is a special type of list. You can also make lists of dataframes 
(just as you can make lists of lists), which I thought the first portion of 
your code would have done:

mysamples<-list()
for (i in 1:10){
mysamples[[i]] <- dataset[ sample(1:1637,100, prob=c(rep(163.7/1637,513), 
rep(245.5/1637,197), rep(1227.8/1637,927)), replace = TRUE), ]

Each element in that list would have been a subset of your larger data.frame 
and would itself have been a data.frame.


> 
> And you are absolutely correct that I do not want each sample in a character
> type matrix.
> 
> In plain English, I hope, I am simply trying to iterate the process of
> removing random samples from the large data set, and then saving these
> samples in a format that is available for simple analysis. For example, if I
> remove five hundred mysample sets, each of which is composed of a 100 x 5
> sample of the large data set I am interested in determining the skewness,
> kurtosis, mean and standard deviation of each of the four numeric variables
> in each of the five hundred mysample sets.

So make a small dataframe with variables (columns) of the same type as in your 
real data, maybe 25-30 rows in "extent" (not "length", since for a dataframe, 
the length() function returns the number of columns).
> 
> 3. Sorting out such problems is best done with smaller test objects. I was
> surprised to see...type character.
> 
> I agree. I began to do this with a small test data set but it was late last
> evening and I realized that I should ask for help before proceeding on what
> I thought might be incorrect assumptions. I clearly misunderstood that a
> list needed to be converted to a data frame in order to be available for
> analysis.

Well, if each list element is already a data.frame then no conversions are 
needed. The lapply function can be used to "loop" over a list, and you can 
define a function that will only look at particular components of those 
elements. There are also functions in packages that automate the process. The 
describe function in Hmisc looksa t each column and decides what type it is

> 
> Thank you for taking the time to respond. The discussion and suggestions are
> very helpful.
> 
> Adam
> 
> ________________________________
> From: David Winsemius <dwinsem...@comcast.net>
> 
> Cc: r-help@r-project.org
> Sent: Mon, December 21, 2009 11:23:43 AM
> Subject: Re: [R] Question About Repeat Random Sampling from a Data Frame
> 
> 
> On Dec 21, 2009, at 10:12 AM, Adam Carr wrote:
> 
>> Good Morning:
>> 
>> I've read many, many posts on the r-help system and I feel compelled to
> quickly admit that I am relatively new to R, I do have several reference
> books around me, but I cannot count myself among the fortunate who seem to
> strong programming intuition.
>> 
>> I have a data set consisting of 1637 observations of five variables:
> tensile strength, yield strength, elongation, hardness and a character
> indicator with three levels: (Y)es, (N)o, and (F)ail.
>> 
>> My objective is to randomly sample various subsets from this data set and
> then evaluate these subsets using simple parameters among them tests for
> normality, shape and skewness. The data set is ordered by the character
> variable prior to sampling, and the samples are weighted to mirror
> representation in an overall, physical process.
>> 
>> I am sampling the data set using this code:
>> 
>> sample <- dataset[sample(1:1637, 500,
> prob=c(rep(163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace
> = TRUE),]
>> 
>> What I would like to do is iterate this process to create many (say 500 or
> more) sampled sets of n=500 and then evaluate each set for the parameters of
> interest. I would actually be evaluating each variable within each subset
> for my characteristic of interest. I am familiar with sampling and saving
> single columns of data to do this sort of thing, but I am not sure how to
> accomplish this with a multiple-variable data set.
>> 
>> For example, I am currently iterating this using a clunky process:
>> 
>> mysamples<-list()
>> for (i in 1:10){
>> mysamples[[i]] <- dataset[
> sample(1:1637,100,prob=c(rep(163.7/1637,513),rep(245.5/1637,197),rep(1227.8/
> 1637,927)),replace = TRUE), ]
>> }
>> 
> 
> Using lists to store intermediate results is not considered clunky in R.
> (You might want to provide statistical justification for the otherwise
> puzzling sampling strategy.)
> 
>> But this leaves me with the additional task of defining each mysample[i]
> iteration and converting it to a form on which I can apply a standard
> statistical test like mean() or skewness() to the variable columns within
> each subset. I have attempted to iteratively convert these lists using this
> code:
>> 
>> mat<-matrix(nrow=100,ncol=5)
>> for (i in 1:length(mysamples))
>> {mat[i]<-do.call('rbind',mysamples[i])}
> 
> It would help if you explained what you are attempting here in ordinary
> English. There are 10 elements in mysamples, each of which is a 100 x 5
> dataframe, and mat is just one 100 x 5 matrix, which you seem to be
> referencing incorrectly, given the fact that it has two, rather than one,
> dimension. Furthermore, those dataframes may not be of a uniform class,
> since you said you had character variable. Do you really want these all in a
> character type matrix, which would be what is likely to happen given R's
> requirement that matrix element be of only one class? What you say above
> suggests not.
> 
>> 
>> but running the code generates the error message: number of items to
> replace is not a multiple of replacement length.
> 
> Because of the way you are referencing the matrix, probably. If you wanted a
> 10 x 100 x 5 array, then create an array. In R, as far as I can tell anyway,
> matrices are necessarily of 2 dimensions. Tables and arrays can be of higher
> dimension.
> 
>> I have tried unsuccessfully, by reading many, many helpful r-help emails
> on this error, to understand my probably obvious mistake.
> 
> Sorting out such problems is best done with smaller test objects. I was
> surprised to see that you thought it was necessary to convert dataframes to
> matrices in order to calculate descriptive statistics. Nothing could be
> farther from the truth. Furthermore, it for some other more valid reason you
> wanted a list of matrices, there is a perfectly good function that will
> convert a dataframe to a matrix, data.matrix(), remembering of course that
> if there is a single character variable in the dataframe, that the entire
> matrix will be of type character.
>> 
>> Based on the small amount that I think I know about R it seems to me that
> sampling the data frame and containing the samples in a list is likely a
> pretty inefficient way to do this task. Any help that any of you could
> provide to assist me in iteratively sampling the data frame, and storing the
> samples in a form on which I can apply other statistical tests would be
> greatly appreciated.
>> 
>> Thank you very much for taking the time to consider my questions.
> --

David Winsemius, MD
Heritage Laboratories
West Hartford, CT


      
        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Question About Repeat Random Sampling from a Data Frame

Reply via email to