Hi: There seems to be a problem if the string ends in , or . , which makes it difficult for strsplit() to pick up if it is splitting on those characters. Here is an alternative, splitting on individual characters and using charmatch() instead:
charsum <- function(s, char) { u <- strsplit(s, "") sum(sapply(u, function(x) charmatch(x, char)), na.rm = TRUE) } unname(sapply(txtvec, function(x) charsum(x, ','))) unname(sapply(txtvec, function(x) charsum(x, '.'))) Putting this into a data frame, dfout <- data.frame(periods = unname(sapply(txtvec, function(x) charsum(x, '.'))), commas = unname(sapply(txtvec, function(x) charsum(x, '.'))) ) txtvec HTH, Dennis On Sat, Jul 2, 2011 at 10:19 AM, David Winsemius <dwinsem...@comcast.net> wrote: > > On Jul 2, 2011, at 12:34 PM, Bansal, Vikas wrote: > >> >> >>>> Dear all, >>>> >>>> I am doing a project on variant calling using R.I am working on >>>> pileup file.There are 10 columns in my data frame and I want to >>>> count the number of A,C,G and T in each row for column 9.example of >>>> column 9 is given below- >>>> >>>> .a,g,, >>>> .t,t,, >>>> .,c,c, >>>> .,a,,, >>>> .,t,t,t >>>> .c,,g,^!. >>>> .g,ggg.^!, >>>> .$,,,,,., >>>> a,g,,t, >>>> ,,,,,.,^!. >>>> ,$,,,,.,. >>>> >>>> This is a bit confusing for me as these characters are in one column >>>> and how can we scan them for each row to print number of A,C,G and T >>>> for each row. >>> >>> Seems a bit clunky but this does the job (first the data): >>>> >>>> txt <- " .a,g,, >>> >>> + .t,t,, >>> + .,c,c, >>> + .,a,,, >>> + .,t,t,t >>> + .c,,g,^!. >>> + .g,ggg.^!, >>> + .$,,,,,., >>> + a,g,,t, >>> + ,,,,,.,^!. >>> + ,$,,,,.,." >>> >>>> txtvec <- readLines(textConnection(txt)) >>> >>> Now the clunky solution, Basically subtracts 1 from the counts of >>> "fragments" that result from splitting on each letter in turn. Could >>> be made prettier with a function that did the job. >>> >>>> data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit, >>> >>> split="a"), length) , "-", 1)), >>> + C = unlist(lapply( lapply( sapply(txtvec, strsplit, split="c"), >>> length) , "-", 1)), >>> + G = unlist(lapply( lapply( sapply(txtvec, strsplit, split="g"), >>> length) , "-", 1)), >>> + T = unlist(lapply( lapply( sapply(txtvec, strsplit, split="t"), >>> length) , "-", 1)) ) >>> A C G T >>> .a,g,, 1 0 1 0 >>> .t,t,, 0 0 0 2 >>> .,c,c, 0 2 0 0 >>> .,a,,, 1 0 0 0 >>> .,t,t,t 0 0 0 2 >>> .c,,g,^!. 0 1 1 0 >>> .g,ggg.^!, 0 0 4 0 >>> .$,,,,,., 0 0 0 0 >>> a,g,,t, 1 0 1 1 >>> ,,,,,.,^!. 0 0 0 0 >>> ,$,,,,.,. 0 0 0 0 >>> >>> Has the advantage that the input data ends up as rownames, which was a >>> surprise. >>> >>> If you wanted to count "A" and "a" as equivalent, then the split >>> argument should be "a|A" >>> >>> >> >>>> AS YOU MENTIONED THAT IF I WANT TO COUNT A AND a I SHOULD SPLIT LIKE >>>> THIS. >> >> BUT CAN I COUNT . AND , ALSO USING- >> data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit, >> split=".|,"), length) , "-", 1)), >> >> I TRIED IT BUT ITS NOT WORKING.IT IS GIVING THE OUTPUT BUT AT SOME PLACES >> IT IS SHOWING MORE NUMBER OF . AND , AND SOMEWHERE IT IS NOT EVEN >> CALCULATING AND JUST SHOWING 0. > > You need to use valid regex expressions for 'split'. Since "." and "," are > special characters they need to be escaped when you wnat the literals to be > recognized as such. > > I haven't figured out why but you need to drop the final operation of > subtracting 1 from the values when counting commas: > > data.frame(periods = unlist(lapply( lapply( sapply(txtvec, strsplit, > split="\\."), length) , "-", 1)) > ,commas = unlist( lapply( sapply(txtvec, strsplit, > split="\\,"), length) ) ) > periods commas > .a,g,, 1 3 > .t,t,, 1 3 > .,c,c, 1 3 > .,a,,, 1 4 > .,t,t,t 1 4 > .c,,g,^!. 1 4 > .g,ggg.^!, 2 2 > .$,,,,,., 2 6 > a,g,,t, 0 4 > ,,,,,.,^!. 1 7 > ,$,,,,.,. 1 7 > > -- > > David Winsemius, MD > West Hartford, CT > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.