> sapply(gregexpr("\\,", txtvec), function(x) if ( x[[1]] != -1) length(x) else 0 )
[1] 3 3 3 4 3 3 2 6 4 6 6
It easily generalizes to period and the `|` (or) operation on letters. ( did need to add the check since the length of gregexpr is always at least one but ihas value -1 when there is no match
> sapply(gregexpr("t|T", txtvec), function(x) if ( x[[1]] != -1) length(x) else 0 )
[1] 0 2 0 0 3 0 0 0 1 0 0 On Jul 2, 2011, at 3:22 PM, Dennis Murphy wrote:
Hi: There seems to be a problem if the string ends in , or . , which makes it difficult for strsplit() to pick up if it is splitting on those characters. Here is an alternative, splitting on individual characters and using charmatch() instead: charsum <- function(s, char) { u <- strsplit(s, "") sum(sapply(u, function(x) charmatch(x, char)), na.rm = TRUE) } unname(sapply(txtvec, function(x) charsum(x, ','))) unname(sapply(txtvec, function(x) charsum(x, '.'))) Putting this into a data frame, dfout <- data.frame(periods = unname(sapply(txtvec, function(x) charsum(x, '.'))), commas = unname(sapply(txtvec, function(x) charsum(x, '.'))) ) txtvec HTH, DennisOn Sat, Jul 2, 2011 at 10:19 AM, David Winsemius <dwinsem...@comcast.net > wrote:On Jul 2, 2011, at 12:34 PM, Bansal, Vikas wrote:Dear all, I am doing a project on variant calling using R.I am working on pileup file.There are 10 columns in my data frame and I want tocount the number of A,C,G and T in each row for column 9.example ofcolumn 9 is given below- .a,g,, .t,t,, .,c,c, .,a,,, .,t,t,t .c,,g,^!. .g,ggg.^!, .$,,,,,., a,g,,t, ,,,,,.,^!. ,$,,,,.,.This is a bit confusing for me as these characters are in one column and how can we scan them for each row to print number of A,C,G and Tfor each row.Seems a bit clunky but this does the job (first the data):txt <- " .a,g,,+ .t,t,, + .,c,c, + .,a,,, + .,t,t,t + .c,,g,^!. + .g,ggg.^!, + .$,,,,,., + a,g,,t, + ,,,,,.,^!. + ,$,,,,.,."txtvec <- readLines(textConnection(txt))Now the clunky solution, Basically subtracts 1 from the counts of"fragments" that result from splitting on each letter in turn. Couldbe made prettier with a function that did the job.data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit,split="a"), length) , "-", 1)), + C = unlist(lapply( lapply( sapply(txtvec, strsplit, split="c"), length) , "-", 1)), + G = unlist(lapply( lapply( sapply(txtvec, strsplit, split="g"), length) , "-", 1)), + T = unlist(lapply( lapply( sapply(txtvec, strsplit, split="t"), length) , "-", 1)) ) A C G T .a,g,, 1 0 1 0 .t,t,, 0 0 0 2 .,c,c, 0 2 0 0 .,a,,, 1 0 0 0 .,t,t,t 0 0 0 2 .c,,g,^!. 0 1 1 0 .g,ggg.^!, 0 0 4 0 .$,,,,,., 0 0 0 0 a,g,,t, 1 0 1 1 ,,,,,.,^!. 0 0 0 0 ,$,,,,.,. 0 0 0 0Has the advantage that the input data ends up as rownames, which was asurprise. If you wanted to count "A" and "a" as equivalent, then the split argument should be "a|A"AS YOU MENTIONED THAT IF I WANT TO COUNT A AND a I SHOULD SPLIT LIKETHIS.BUT CAN I COUNT . AND , ALSO USING- data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit, split=".|,"), length) , "-", 1)),I TRIED IT BUT ITS NOT WORKING.IT IS GIVING THE OUTPUT BUT AT SOME PLACESIT IS SHOWING MORE NUMBER OF . AND , AND SOMEWHERE IT IS NOT EVEN CALCULATING AND JUST SHOWING 0.You need to use valid regex expressions for 'split'. Since "." and "," are special characters they need to be escaped when you wnat the literals to berecognized as such. I haven't figured out why but you need to drop the final operation of subtracting 1 from the values when counting commas: data.frame(periods = unlist(lapply( lapply( sapply(txtvec, strsplit, split="\\."), length) , "-", 1)) ,commas = unlist( lapply( sapply(txtvec, strsplit, split="\\,"), length) ) ) periods commas .a,g,, 1 3 .t,t,, 1 3 .,c,c, 1 3 .,a,,, 1 4 .,t,t,t 1 4 .c,,g,^!. 1 4 .g,ggg.^!, 2 2 .$,,,,,., 2 6 a,g,,t, 0 4 ,,,,,.,^!. 1 7 ,$,,,,.,. 1 7 -- David Winsemius, MD West Hartford, CT ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD West Hartford, CT ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.