Re: [R] creating a quantile variable based on subsets of a dataframe

David Winsemius Tue, 06 Aug 2013 10:32:31 -0700

On Aug 6, 2013, at 10:17 AM, Gavin Rudge wrote:

> #some sample data:
> library(Hmisc)
> set.seed(33)
> df<-data.frame(x=(sample(letters[1:10],1000,replace=TRUE)),y=rnorm(1000,mean=20,sd=15))
> 
> x is a category from a to J, say a geographical area, into which an 
> observation y falls, y being a score.  Now if I want to put my score into 
> quantiles (quintiles in this case) across the whole population of 
> observations and then make a quintile variable I do the following:
> 
> #make a quintile variable
> df<- within(df,z<-as.integer(cut2(y,quantile(y,probs=seq(0,1,0.2)))))
> 
> I'm using cut2 here as I want the extremes of my ranges to be included in the 
> upper and lower bins.
> 
> So far so good, but I would also like another variable to indicate the 
> quintile of the score within the areas indicated by the x variable, so all of 
> the scores where x=a, binned into quintiles for area a, the same for scores 
> in areas b, c and so on.
> 
> I see that I could put my quintile variable code into a function and then 
> split my data frame by x, apply the function in each of the ten groups and 
> stitch the whole thing back together again (not sure I could write it 
> though), but is there a much simpler solution?
>


Generally questions involving the distribution of a single variate grouped 
within categories where the desired result is as long as the original variate 
are well handled with th `ave` function:

> df$c2.grp <- ave(df$y, df$x, FUN=function(z) cut2(z, 
> quantile(z,probs=seq(0,1,0.2)) ) )
> str(df)
'data.frame':   1000 obs. of  3 variables:
 $ x     : Factor w/ 10 levels "a","b","c","d",..: 5 4 5 10 9 6 5 4 1 2 ...
 $ y     : num  15 45.3 29.9 45.2 23.3 ...
 $ c2.grp: num  2 5 4 5 3 4 2 4 3 2 ...

I was a bit surprised that the resulting column in df was numeric rather than 
factor, but I suspect it was the fact that the levels of the intra-groups 
splits could not be reconciled. You didn't apparently consider that issue in 
your problem specification. The result could be "cleaned up" with:

> df$c2.grp <- factor(df$c2.grp, labels=paste0("Q", 1:5) )
> with(df, table(x, c2.grp))
   c2.grp
x   Q1 Q2 Q3 Q4 Q5
  a 22 23 22 22 22
  b 19 19 18 19 19
  c 21 20 20 20 21
  d 20 19 19 19 20
  e 19 20 21 20 20
  f 21 21 21 21 22
  g 21 21 21 21 22
  h 19 19 19 19 19
  i 18 18 17 18 18
  j 20 20 19 20 20
-- 

David Winsemius
Alameda, CA, USA

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] creating a quantile variable based on subsets of a dataframe

Reply via email to