Hi Harold, What about this? You one have to make the crosstabulation once.
> qq <- data.frame(student = factor(c(1,1,2,2,2)), teacher = factor(c(10,10,20,20,25))) > tab <- table(qq$student, qq$teacher) > data.frame(Student = rownames(tab), Freq = rowSums(tab), tch = rowSums(tab > 0) == 1) Student Freq tch 1 1 2 TRUE 2 2 3 FALSE HTH, Thierry ------------------------------------------------------------------------ ---- ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest Cel biometrie, methodologie en kwaliteitszorg / Section biometrics, methodology and quality assurance Gaverstraat 4 9500 Geraardsbergen Belgium tel. + 32 54/436 185 thierry.onkel...@inbo.be www.inbo.be To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey -----Oorspronkelijk bericht----- Van: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] Namens Doran, Harold Verzonden: vrijdag 27 februari 2009 15:47 Aan: r-help@r-project.org Onderwerp: [R] Making tapply code more efficient Previously, I posed the question pasted down below to the list and received some very helpful responses. While the code suggestions provided in response indeed work, they seem to only work with *very* small data sets and so I wanted to follow up and see if anyone had ideas for better efficiency. I was quite embarrased on this as our SAS programmers cranked out programs that did this in the blink of an eye (with a few variables), but R was spinning for days on my Ubuntu machine and ultimately I saw a message that R was "killed". The data I am working with has 800967 total rows and 31 total columns. The ID variable I use as the index variable in tapply() has 326397 unique cases. > length(unique(qq$student_unique_id)) [1] 326397 To give a sense of what my data look like and the actual problem, consider the following: qq <- data.frame(student_unique_id = factor(c(1,1,2,2,2)), teacher_unique_id = factor(c(10,10,20,20,25))) This is a student achievement database where students occupy multiple rows in the data and the variable teacher_unique_id denotes the class the student was in. What I am doing is looking to see if the teacher is the same for each instance of the unique student ID. So, if I implement the following: same <- function(x) length( unique(x) ) == 1 results <- data.frame( freq = tapply(qq$student_unique_id, qq$student_unique_id, length), tch = tapply(qq$teacher_unique_id, qq$student_unique_id, same) ) I get the following results. I can see that student 1 appears in the data twice and the teacher is always the same. However, student 2 appears three times and the teacher is not always the same. > results freq tch 1 2 TRUE 2 3 FALSE Now, implementing this same procedure to a large data set with the characteristics described above seems to be problematic in this implementation. Does anyone have reactions on how this could be more efficient such that it can run with large data as I described? Harold > sessionInfo() R version 2.8.1 (2008-12-22) x86_64-pc-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.U TF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME= C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATI ON=C attached base packages: [1] stats graphics grDevices utils datasets methods base ##### Original question posted on 1/13/09 Suppose I have a dataframe as follows: dat <- data.frame(id = c(1,1,2,2,2), var1 = c(10,10,20,20,25), var2 = c('foo', 'foo', 'foo', 'foobar', 'foo')) Now, if I were to subset by id, such as: > subset(dat, id==1) id var1 var2 1 1 10 foo 2 1 10 foo I can see that the elements in var1 are exactly the same and the elements in var2 are exactly the same. However, > subset(dat, id==2) id var1 var2 3 2 20 foo 4 2 20 foobar 5 2 25 foo Shows the elements are not the same for either variable in this instance. So, what I am looking to create is a data frame that would be like this id freq var1 var2 1 2 TRUE TRUE 2 3 FALSE FALSE Where freq is the number of times the ID is repeated in the dataframe. A TRUE appears in the cell if all elements in the column are the same for the ID and FALSE otherwise. It is insignificant which values differ for my problem. The way I am thinking about tackling this is to loop through the ID variable and compare the values in the various columns of the dataframe. The problem I am encountering is that I don't think all.equal or identical are the right functions in this case. So, say I was wanting to compare the elements of var1 for id ==1. I would have x <- c(10,10) Of course, the following works > all.equal(x[1], x[2]) [1] TRUE As would a similar call to identical. However, what if I only have a vector of values (or if the column consists of names) that I want to assess for equality when I am trying to automate a process over thousands of cases? As in the example above, the vector may contain only two values or it may contain many more. The number of values in the vector differ by id. Any thoughts? Harold ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver weer en binden het INBO onder geen enkel beding, zolang dit bericht niet bevestigd is door een geldig ondertekend document. The views expressed in this message and any annex are purely those of the writer and may not be regarded as stating an official position of INBO, as long as the message is not confirmed by a duly signed document. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.