Thank you Mohamed and Bill for your replies. (I did not send the data because it is unwieldy.)
Yes Bill, the issue arises directly from what you had guessed. I was working with a subset of the data (which implicitly had factors for the complete data set). On this, what is the best way take a subset of the data which ignores these "extraneous" factors? > log<-data.frame(Flag=1:2, RequestID=factor(letters[1:2],levels=letters[1:10])) > log2 <-subset(log, RequestID=="a") > levels(log2$RequestID) [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" In other words, how do I take a subset which yields "a" as the only level for log2? Alex -----Original Message----- From: William Dunlap [mailto:wdun...@tibco.com] Sent: Thursday, October 15, 2009 11:59 PM To: Alexander Peterhansl; r-help@r-project.org Subject: RE: [R] tapply() and using factor() on a factor > -----Original Message----- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Behalf Of Alexander > Peterhansl > Sent: Thursday, October 15, 2009 2:50 PM > To: r-help@r-project.org > Subject: [R] tapply() and using factor() on a factor > > Dear List, > > > > Shouldn't result1 and result2 be equal in the following case? > > > > Note that log$RequestID is a factor. That is, > is.factor(log$RequestID) > yields TRUE. > > > > result1 <- tapply(log$Flag,factor(log$RequestID),sum) > > result2 <- tapply(log$Flag,log$RequestID,sum) Showing us the output of dput(log) (or str(log) and summary(log)) would let people discover the problem more readily. Since you didn't I'll guess what the dataset may contain. If log$RequestID is a factor with lots of unused levels tapply will output an NA for each unused level. factor(log$RequestID) will create a new set of levels, only those actually used, so tapply will not be forced to fill those spots with NA's. E.g., > log<-data.frame(Flag=1:2, RequestID=factor(letters[1:2], levels=letters[1:10])) > tapply(log$Flag, log$RequestID, sum) a b c d e f g h i j 1 2 NA NA NA NA NA NA NA NA > tapply(log$Flag, factor(log$RequestID), sum) a b 1 2 I suppose tapply(X,INDEX,FUN) could call FUN(X[0]) to see how to fill the cells with no data behind them, but it doesn't. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > > > > Yet, when I summarize the output, I get the following: > > summary(result1) > > Min. 1st Qu. Median Mean 3rd Qu. Max. > > 11.00 11.00 11.00 26.06 11.00 101.00 > > > > summary(result2) > > Min. 1st Qu. Median Mean 3rd Qu. Max. NA's > > 11.00 11.00 11.00 26.06 11.00 101.00 978.00 > > > > Why does result2 have 978 NA's? > > > > Any help on this would be appreciated. > > > > Alex > > > > > > > > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.