I had to look up winsorized; this should do it: > #My reproducible example: > firm<-sort(rep(1:1000,10),decreasing=F) > year<-rep(1998:2007,1000) > industry<-rep(c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10),rep(6,10),rep(7,10),rep(8,10),rep(9,10), + rep(10,10)),1000) > X1<-rnorm(10000) > data<-data.frame(firm, industry,year,X1) > # split the data by industry/year > d.s <- split(data, list(data$industry, data$year), drop=TRUE) > result <- lapply(d.s, function(.id){ + # get 10/90% values + .limit <- quantile(.id$X1, prob=c(.1, .9)) + .id$X1[.id$X1 < .limit[1]] <- .limit[1] + .id$X1[.id$X1 > .limit[2]] <- .limit[2] + .id + }) > > > str(result) List of 100 $ 1.1998 :'data.frame': 1000 obs. of 4 variables: ..$ firm : int [1:1000] 1 11 21 31 41 51 61 71 81 91 ... ..$ industry: num [1:1000] 1 1 1 1 1 1 1 1 1 1 ... ..$ year : int [1:1000] 1998 1998 1998 1998 1998 1998 1998 1998 1998 1998 ... ..$ X1 : num [1:1000] -0.9807 1.0728 1.1759 0.0238 0.976 ... $ 2.1998 :'data.frame': 1000 obs. of 4 variables: ..$ firm : int [1:1000] 2 12 22 32 42 52 62 72 82 92 ... ..$ industry: num [1:1000] 2 2 2 2 2 2 2 2 2 2 ... ..$ year : int [1:1000] 1998 1998 1998 1998 1998 1998 1998 1998 1998 1998 ... ..$ X1 : num [1:1000] 0.874 -0.213 -0.971 1.062 -0.099 ... $ 3.1998 :'data.frame': 1000 obs. of 4 variables: ..$ firm : int [1:1000] 3 13 23 33 43 53 63 73 83 93 ... ..$ industry: num [1:1000] 3 3 3 3 3 3 3 3 3 3 ... ..$ year : int [1:1000] 1998 1998 1998 1998 1998 1998 1998 1998 1998 1998 ... ..$ X1 : num [1:1000] -0.3051 -0.0775 -1.2062 -0.8826 0.6671 ...
On Mon, Aug 2, 2010 at 6:54 PM, Cecilia Carmo <cecilia.ca...@ua.pt> wrote: > Thank you again, but I think I need to do some homework about the split > function, because I'm not understanding it very well. > Besides, I think I still have a problem. I also need X2 = X1 winsorized: X2 > is equal to X1 between 10%-90%, and is equal to the 10% value when < 10% and > equal to the 90% value when it is >. > Could you help me? > > Thank you > Cecília > > Em Mon, 2 Aug 2010 18:42:27 -0400 > jim holtman <jholt...@gmail.com> escreveu: >> >> This is just following up with the example data you sent. This will >> create a list 'result' that will have the subset of data between the >> 10% & 90%-tiles of the data: >> >>> #My reproducible example: >>> firm<-sort(rep(1:1000,10),decreasing=F) >>> year<-rep(1998:2007,1000) >>> >>> industry<-rep(c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10),rep(6,10),rep(7,10),rep(8,10),rep(9,10), >> >> + rep(10,10)),1000) >>> >>> X1<-rnorm(10000) >>> data<-data.frame(firm, industry,year,X1) >>> # split the data by industry/year >>> d.s <- split(data, list(data$industry, data$year), drop=TRUE) >>> result <- lapply(d.s, function(.id){ >> >> + # get 10/90% values >> + .limit <- quantile(.id$X1, prob=c(.1, .9)) >> + subset(.id, X1 >= .limit[1] & X1 <= .limit[2]) >> + }) >>> >>> str(result) >> >> List of 100 >> $ 1.1998 :'data.frame': 800 obs. of 4 variables: >> ..$ firm : int [1:800] 1 21 31 41 51 61 71 81 91 111 ... >> ..$ industry: num [1:800] 1 1 1 1 1 1 1 1 1 1 ... >> ..$ year : int [1:800] 1998 1998 1998 1998 1998 1998 1998 1998 >> 1998 1998 ... >> ..$ X1 : num [1:800] 0.659 -0.105 -0.617 0.342 -1.077 ... >> $ 2.1998 :'data.frame': 800 obs. of 4 variables: >> ..$ firm : int [1:800] 2 32 42 52 62 72 102 112 132 162 ... >> ..$ industry: num [1:800] 2 2 2 2 2 2 2 2 2 2 ... >> ..$ year : int [1:800] 1998 1998 1998 1998 1998 1998 1998 1998 >> 1998 1998 ... >> ..$ X1 : num [1:800] -1.1044 -0.0666 -0.9184 0.3469 -0.2348 ... >> >> You can see that the 'name' of the list element is the industry.year >> combination; this can also be seen in the data. >> >> On Mon, Aug 2, 2010 at 6:20 PM, Cecilia Carmo <cecilia.ca...@ua.pt> wrote: >>> >>> Thank you for your help but I don't understand how can I have a dataframe >>> with the columns: firm, year, industry, X1 and X2. Could you help me >>> (again)? >>> >>> >>> Cecília Carmo >>> >>> >>> Em Sat, 31 Jul 2010 22:10:38 -0400 >>> jim holtman <jholt...@gmail.com> escreveu: >>>> >>>> This will split the data by industry & year and then return the values >>>> that include the 80%-tile (>=10% & <= 90%) >>>> >>>> # split the data by industry/year >>>> d.s <- split(data, list(data$industry, data$year), drop=TRUE) >>>> result <- lapply(d.s, function(.id){ >>>> # get 10/90% values >>>> .limit <- quantile(.id$X1, prob=c(.1, .9)) >>>> subset(.id, X1 >= .limit[1] & X1 <= .limit[2]) >>>> }) >>>> >>>> This returns a list of 100 elements for each combination. >>>> >>>> On Sat, Jul 31, 2010 at 9:39 PM, Cecilia Carmo <cecilia.ca...@ua.pt> >>>> wrote: >>>>> >>>>> Hi everyone! >>>>> >>>>> #I need a loop or a function that creates a X2 variable that is X1 >>>>> without >>>>> the extreme values (or X1 winsorized) by industry and year. >>>>> >>>>> #My reproducible example: >>>>> firm<-sort(rep(1:1000,10),decreasing=F) >>>>> year<-rep(1998:2007,1000) >>>>> >>>>> >>>>> industry<-rep(c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10),rep(6,10),rep(7,10),rep(8,10),rep(9,10), >>>>> rep(10,10)),1000) >>>>> X1<-rnorm(10000) >>>>> data<-data.frame(firm, industry,year,X1) >>>>> data >>>>> >>>>> The way I’m doing this is very hard. I split my sample by industry and >>>>> year, >>>>> for each industry and year I calculate the 10% and 90% quantiles, then >>>>> I >>>>> create a X2 variable like this: >>>>> >>>>> industry1<-subset(data,data$industry==1) >>>>> >>>>> ind1year1999<-subset(industry1,industry1$year==1999) >>>>> q1<-quantile(ind1year1999$X1,probs=0.1,na.rm=TRUE) >>>>> q99<-quantile(ind1year1999$X1,probs=0.90,na.rm=TRUE) >>>>> >>>>> >>>>> ind1year1999winsorized<-transform(ind1year1999,X2=ifelse(X1<q1,q1,ifelse(X1>q99,q99,X1))) >>>>> >>>>> ind1year2000<-subset(industry1,industry1$year==2000) >>>>> q1<-quantile(ind1year2000$X1,probs=0.1,na.rm=TRUE) >>>>> q99<-quantile(ind1year2000$X1,probs=0.90,na.rm=TRUE) >>>>> >>>>> >>>>> ind1year2000winsorized<-transform(ind1year2000,X2=ifelse(X1<q1,q1,ifelse(X1>q99,q99,X1))) >>>>> >>>>> I repeat this for all years and industries, and then I merge/bind all >>>>> again >>>>> to have a new dataframe with all the columns of the dataframe «data» >>>>> plus >>>>> X2. >>>>> >>>>> Could anyone help me doing this in a easier way? >>>>> >>>>> Thanks >>>>> Cecília Carmo >>>>> Universidade de Aveiro - Portugal >>>>> >>>>> ______________________________________________ >>>>> R-help@r-project.org mailing list >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide >>>>> http://www.R-project.org/posting-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>> >>>> >>>> >>>> -- >>>> Jim Holtman >>>> Cincinnati, OH >>>> +1 513 646 9390 >>>> >>>> What is the problem that you are trying to solve? >>> >>> >>> >> >> >> >> -- >> Jim Holtman >> Cincinnati, OH >> +1 513 646 9390 >> >> What is the problem that you are trying to solve? > > > -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.