Dear all,
I am doing a project on variant calling using R.I am working on
pileup file.There are 10 columns in my data frame and I want to
count the number of A,C,G and T in each row for column 9.example of
column 9 is given below-
.a,g,,
.t,t,,
.,c,c,
.,a,,,
.,t,t,t
.c,,g,^!.
.g,ggg.^!,
.$,,,,,.,
a,g,,t,
,,,,,.,^!.
,$,,,,.,.
This is a bit confusing for me as these characters are in one column
and how can we scan them for each row to print number of A,C,G and T
for each row.
Seems a bit clunky but this does the job (first the data):
txt <- " .a,g,,
+ .t,t,,
+ .,c,c,
+ .,a,,,
+ .,t,t,t
+ .c,,g,^!.
+ .g,ggg.^!,
+ .$,,,,,.,
+ a,g,,t,
+ ,,,,,.,^!.
+ ,$,,,,.,."
txtvec <- readLines(textConnection(txt))
Now the clunky solution, Basically subtracts 1 from the counts of
"fragments" that result from splitting on each letter in turn. Could
be made prettier with a function that did the job.
data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit,
split="a"), length) , "-", 1)),
+ C = unlist(lapply( lapply( sapply(txtvec, strsplit, split="c"),
length) , "-", 1)),
+ G = unlist(lapply( lapply( sapply(txtvec, strsplit, split="g"),
length) , "-", 1)),
+ T = unlist(lapply( lapply( sapply(txtvec, strsplit, split="t"),
length) , "-", 1)) )
A C G T
.a,g,, 1 0 1 0
.t,t,, 0 0 0 2
.,c,c, 0 2 0 0
.,a,,, 1 0 0 0
.,t,t,t 0 0 0 2
.c,,g,^!. 0 1 1 0
.g,ggg.^!, 0 0 4 0
.$,,,,,., 0 0 0 0
a,g,,t, 1 0 1 1
,,,,,.,^!. 0 0 0 0
,$,,,,.,. 0 0 0 0
Has the advantage that the input data ends up as rownames, which
was a
surprise.
If you wanted to count "A" and "a" as equivalent, then the split
argument should be "a|A"