Dear List, I have a data set stored in the following format:
> head(dat, n = 10) id sppcode abundance 1 10307 10000000 1 2 10307 16220602 2 3 10307 20000000 5 4 10307 20110000 2 5 10307 24000000 1 6 10307 40210000 83 7 10307 40210102 45 8 10307 45140000 1 9 10307 45630000 1 10 10307 45630600 41 > str(dat) 'data.frame': 111 obs. of 3 variables: $ id : Factor w/ 3 levels "10307","10719",..: 1 1 1 1 1 1 1 1 1 1 ... $ sppcode : chr "10000000" "16220602" "20000000" "20110000" ... $ abundance: num 1 2 5 2 1 83 45 1 1 41 ... that represent counts of species, recorded with a particular coding system. The abundance column is not needed for this particular operation, but is present in the data files. I am interested in counting entries (rows) in the sppcode component of dat. The sppcode takes a particular format: Order Family Genus Species, with 2 alphanumeric digits allocated for each level of the hierarchy. I want to know how many species there are in each site (the id factor), but I should only count a higher level entry if there are no lower levels present. For example, for the above data excerpt (just the headed rows), I would count the following rows: 10000000 16220602 20110000 24000000 40320203 45140000 45630600 == 7 "species" present. To be more specific, I don't count 45630000 (row 9) because there exists a sppcode for this 'id' where either of the next two pairs of digits are not all 0's. In words, I want to count all rows where WWXXYYZZ are ZZ != 00, then, rows where ZZ == 00 only if the WWXXYY combination has not been counted yet. An example data set has been placed in my University web space and can be read into R with the following: ## read example csv data dat <- read.csv(url("http://www.homepages.ucl.ac.uk/~ucfagls/files/example_data.csv"), colClasses = c("factor","character","numeric")) ## show the data head(dat, n = 10) And the sppcode variable can be broken out into the 4 levels if required via: ## split out the four levels of categorisation: dat2 <- data.frame(dat, order = with(dat, substr(sppcode, 1, 2)), family = with(dat, substr(sppcode, 3, 4)), genus = with(dat, substr(sppcode, 5, 6)), species = with(dat, substr(sppcode, 7, 8))) The actual data set/problem contains several hundred different id's. I can't see an efficient way of processing these data in the manner described. Any help would be most gratefully received. Many thanks, Gavin -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
signature.asc
Description: This is a digitally signed message part
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.