To answer my own post, and for the archives (hopefully not that anyone has to repeat what I had to do ;-), after much hair-pulling , frowning at the screen and general dumb headedness the following slab of R code achieves the results I wanted. It isn't elegant but does a job.
msr <- function(x) { res <- numeric(length = length(levels(x$id))) names(res) <- levels(x$id) for(site in levels(x$id)) { ## subset just data for this site DAT <- x[x$id == site, ] ## split out the spp and count the ones not 00 spp <- with(DAT, substr(sppcode, 7, 8)) spp.counted <- which(spp != "00") spp <- with(DAT[spp.counted, ], sppcode) SPP <- length(spp.counted) DAT <- DAT[-spp.counted, ] ## drop genera for spp already counted want <- with(DAT, which(substr(sppcode, 1, 6) %in% substr(spp, 1, 6))) if(length(want) >= 1) { DAT <- DAT[-want, ] } ## now count genera remaining not 00 gen <- with(DAT, substr(sppcode, 5, 6)) gen.counted <- which(gen != "00") gen <- with(DAT[gen.counted, ], sppcode) GEN <- length(gen.counted) DAT <- DAT[-gen.counted, ] ## drop families already in spp, or genera that we already caught want1 <- with(DAT, which(substr(sppcode, 1, 4) %in% substr(spp, 1, 4))) want2 <- with(DAT, which(substr(sppcode, 1, 4) %in% substr(gen, 1, 4))) if(length(want <- unique(c(want1, want2))) >= 1) { DAT <- DAT[-want, ] } ## count remaining families != 00 fam <- with(DAT, substr(sppcode, 3, 4)) fam.counted <- which(fam != "00") fam <- with(DAT[fam.counted, ], sppcode) FAM <- length(fam.counted) DAT <- DAT[-fam.counted, ] ## drop orders for families already counted want1 <- with(DAT, which(substr(sppcode, 1, 2) %in% substr(spp, 1, 2))) want2 <- with(DAT, which(substr(sppcode, 1, 2) %in% substr(gen, 1, 2))) want3 <- with(DAT, which(substr(sppcode, 1, 2) %in% substr(fam, 1, 2))) if(length(want <- unique(c(want1, want2, want3))) >= 1) { DAT <- DAT[-want, ] } ## count the orders remaining ORD <- nrow(DAT) ## populate return vector res[site] <- SPP + GEN + FAM + ORD } return(res) } ## read example csv data dat <- read.csv(url("http://www.homepages.ucl.ac.uk/~ucfagls/files/example_data.csv"), colClasses = c("factor","character","numeric")) ## show the data head(dat, n = 10) ## split out the four levels of categorisation: dat2 <- data.frame(dat, order = with(dat, substr(sppcode, 1, 2)), family = with(dat, substr(sppcode, 3, 4)), genus = with(dat, substr(sppcode, 5, 6)), species = with(dat, substr(sppcode, 7, 8))) msr(dat) Yields: > msr(dat) 10307 10719 10786 15 40 35 Which are correct. G On Wed, 2009-02-18 at 13:37 +0000, Gavin Simpson wrote: > Dear List, > > I have a data set stored in the following format: > > > head(dat, n = 10) > id sppcode abundance > 1 10307 10000000 1 > 2 10307 16220602 2 > 3 10307 20000000 5 > 4 10307 20110000 2 > 5 10307 24000000 1 > 6 10307 40210000 83 > 7 10307 40210102 45 > 8 10307 45140000 1 > 9 10307 45630000 1 > 10 10307 45630600 41 > > str(dat) > 'data.frame': 111 obs. of 3 variables: > $ id : Factor w/ 3 levels "10307","10719",..: 1 1 1 1 1 1 1 1 1 1 ... > $ sppcode : chr "10000000" "16220602" "20000000" "20110000" ... > $ abundance: num 1 2 5 2 1 83 45 1 1 41 ... > > that represent counts of species, recorded with a particular coding > system. The abundance column is not needed for this particular > operation, but is present in the data files. > > I am interested in counting entries (rows) in the sppcode component of > dat. The sppcode takes a particular format: Order Family Genus Species, > with 2 alphanumeric digits allocated for each level of the hierarchy. I > want to know how many species there are in each site (the id factor), > but I should only count a higher level entry if there are no lower > levels present. > > For example, for the above data excerpt (just the headed rows), I would > count the following rows: > > 10000000 > 16220602 > 20110000 > 24000000 > 40320203 > 45140000 > 45630600 == 7 "species" present. > > To be more specific, I don't count 45630000 (row 9) because there exists > a sppcode for this 'id' where either of the next two pairs of digits are > not all 0's. > > In words, I want to count all rows where WWXXYYZZ are ZZ != 00, then, > rows where ZZ == 00 only if the WWXXYY combination has not been counted > yet. > > An example data set has been placed in my University web space and can > be read into R with the following: > > ## read example csv data > dat <- > read.csv(url("http://www.homepages.ucl.ac.uk/~ucfagls/files/example_data.csv"), > colClasses = c("factor","character","numeric")) > ## show the data > head(dat, n = 10) > > And the sppcode variable can be broken out into the 4 levels if required via: > > ## split out the four levels of categorisation: > dat2 <- data.frame(dat, > order = with(dat, substr(sppcode, 1, 2)), > family = with(dat, substr(sppcode, 3, 4)), > genus = with(dat, substr(sppcode, 5, 6)), > species = with(dat, substr(sppcode, 7, 8))) > > The actual data set/problem contains several hundred different id's. > > I can't see an efficient way of processing these data in the manner > described. Any help would be most gratefully received. > > Many thanks, > > Gavin > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
signature.asc
Description: This is a digitally signed message part
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.