DEAR ALL, I TRIED THIS CODE AND THIS IS RUNNING PERFECTLY... df=read.table("Case2.pileup",fill=T,sep="\t",colClasses = "character") txt=df[,9] txtvec <- readLines(textConnection(txt)) dad=data.frame(A = unlist(sapply(gregexpr("A|a", txtvec), function(x) if ( x[[1]] != -1) length(x) else 0 )), C = unlist(sapply(gregexpr("C|c", txtvec), function(x) if ( x[[1]] != -1) length(x) else 0 )), G = unlist(sapply(gregexpr("G|g", txtvec), function(x) if ( x[[1]] != -1) length(x) else 0 )), T = unlist(sapply(gregexpr("T|t", txtvec), function(x) if ( x[[1]] != -1) length(x) else 0 )), N = unlist(sapply(gregexpr("\\,|\\.", txtvec), function(x) if ( x[[1]] != -1) length(x) else 0 )))
Thanking you, Warm Regards Vikas Bansal Msc Bioinformatics Kings College London ________________________________________ From: David Winsemius [dwinsem...@comcast.net] Sent: Saturday, July 02, 2011 9:04 PM To: Dennis Murphy Cc: r-help@r-project.org; Bansal, Vikas Subject: Re: [R] For help in R coding On reflection and a bit of testing I think the best approach would be to use gregexpr. For counting the number of commas, this appears quite straightforward. > sapply(gregexpr("\\,", txtvec), function(x) if ( x[[1]] != -1) length(x) else 0 ) [1] 3 3 3 4 3 3 2 6 4 6 6 It easily generalizes to period and the `|` (or) operation on letters. ( did need to add the check since the length of gregexpr is always at least one but ihas value -1 when there is no match > sapply(gregexpr("t|T", txtvec), function(x) if ( x[[1]] != -1) length(x) else 0 ) [1] 0 2 0 0 3 0 0 0 1 0 0 On Jul 2, 2011, at 3:22 PM, Dennis Murphy wrote: > Hi: > > There seems to be a problem if the string ends in , or . , which makes > it difficult for strsplit() to pick up if it is splitting on those > characters. Here is an alternative, splitting on individual characters > and using charmatch() instead: > > charsum <- function(s, char) { > u <- strsplit(s, "") > sum(sapply(u, function(x) charmatch(x, char)), na.rm = TRUE) > } > > unname(sapply(txtvec, function(x) charsum(x, ','))) > unname(sapply(txtvec, function(x) charsum(x, '.'))) > > Putting this into a data frame, > > dfout <- data.frame(periods = unname(sapply(txtvec, function(x) > charsum(x, '.'))), > commas = unname(sapply(txtvec, > function(x) charsum(x, '.'))) ) > txtvec > > HTH, > Dennis > > On Sat, Jul 2, 2011 at 10:19 AM, David Winsemius <dwinsem...@comcast.net > > wrote: >> >> On Jul 2, 2011, at 12:34 PM, Bansal, Vikas wrote: >> >>> >>> >>>>> Dear all, >>>>> >>>>> I am doing a project on variant calling using R.I am working on >>>>> pileup file.There are 10 columns in my data frame and I want to >>>>> count the number of A,C,G and T in each row for column 9.example >>>>> of >>>>> column 9 is given below- >>>>> >>>>> .a,g,, >>>>> .t,t,, >>>>> .,c,c, >>>>> .,a,,, >>>>> .,t,t,t >>>>> .c,,g,^!. >>>>> .g,ggg.^!, >>>>> .$,,,,,., >>>>> a,g,,t, >>>>> ,,,,,.,^!. >>>>> ,$,,,,.,. >>>>> >>>>> This is a bit confusing for me as these characters are in one >>>>> column >>>>> and how can we scan them for each row to print number of A,C,G >>>>> and T >>>>> for each row. >>>> >>>> Seems a bit clunky but this does the job (first the data): >>>>> >>>>> txt <- " .a,g,, >>>> >>>> + .t,t,, >>>> + .,c,c, >>>> + .,a,,, >>>> + .,t,t,t >>>> + .c,,g,^!. >>>> + .g,ggg.^!, >>>> + .$,,,,,., >>>> + a,g,,t, >>>> + ,,,,,.,^!. >>>> + ,$,,,,.,." >>>> >>>>> txtvec <- readLines(textConnection(txt)) >>>> >>>> Now the clunky solution, Basically subtracts 1 from the counts of >>>> "fragments" that result from splitting on each letter in turn. >>>> Could >>>> be made prettier with a function that did the job. >>>> >>>>> data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit, >>>> >>>> split="a"), length) , "-", 1)), >>>> + C = unlist(lapply( lapply( sapply(txtvec, strsplit, split="c"), >>>> length) , "-", 1)), >>>> + G = unlist(lapply( lapply( sapply(txtvec, strsplit, split="g"), >>>> length) , "-", 1)), >>>> + T = unlist(lapply( lapply( sapply(txtvec, strsplit, split="t"), >>>> length) , "-", 1)) ) >>>> A C G T >>>> .a,g,, 1 0 1 0 >>>> .t,t,, 0 0 0 2 >>>> .,c,c, 0 2 0 0 >>>> .,a,,, 1 0 0 0 >>>> .,t,t,t 0 0 0 2 >>>> .c,,g,^!. 0 1 1 0 >>>> .g,ggg.^!, 0 0 4 0 >>>> .$,,,,,., 0 0 0 0 >>>> a,g,,t, 1 0 1 1 >>>> ,,,,,.,^!. 0 0 0 0 >>>> ,$,,,,.,. 0 0 0 0 >>>> >>>> Has the advantage that the input data ends up as rownames, which >>>> was a >>>> surprise. >>>> >>>> If you wanted to count "A" and "a" as equivalent, then the split >>>> argument should be "a|A" >>>> >>>> >>> >>>>> AS YOU MENTIONED THAT IF I WANT TO COUNT A AND a I SHOULD SPLIT >>>>> LIKE >>>>> THIS. >>> >>> BUT CAN I COUNT . AND , ALSO USING- >>> data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit, >>> split=".|,"), length) , "-", 1)), >>> >>> I TRIED IT BUT ITS NOT WORKING.IT IS GIVING THE OUTPUT BUT AT SOME >>> PLACES >>> IT IS SHOWING MORE NUMBER OF . AND , AND SOMEWHERE IT IS NOT EVEN >>> CALCULATING AND JUST SHOWING 0. >> >> You need to use valid regex expressions for 'split'. Since "." and >> "," are >> special characters they need to be escaped when you wnat the >> literals to be >> recognized as such. >> >> I haven't figured out why but you need to drop the final operation of >> subtracting 1 from the values when counting commas: >> >> data.frame(periods = unlist(lapply( lapply( sapply(txtvec, strsplit, >> split="\\."), length) , "-", 1)) >> ,commas = unlist( lapply( sapply(txtvec, strsplit, >> split="\\,"), length) ) ) >> periods commas >> .a,g,, 1 3 >> .t,t,, 1 3 >> .,c,c, 1 3 >> .,a,,, 1 4 >> .,t,t,t 1 4 >> .c,,g,^!. 1 4 >> .g,ggg.^!, 2 2 >> .$,,,,,., 2 6 >> a,g,,t, 0 4 >> ,,,,,.,^!. 1 7 >> ,$,,,,.,. 1 7 >> >> -- >> >> David Winsemius, MD >> West Hartford, CT >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> David Winsemius, MD West Hartford, CT ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.