Dear listserv,

I don't know if this question is more appropriate for the Bioconductor listserv 
or the general R listserv. I am asking it here because I believe this problem 
can be solved using regular R commands in the base package. I suspect you all 
will be very helpful.

I have genetic sequence data in the following form. Each letter represents a 
nucleotide.

ref.sequence <- "ATAGCCGCA"
sequence1 <- "AT[G][C][C]AGCCG[T]CA"
sequence2 <- "ATAGCCGC[C][A][C]A"
sequence3 <- "AT[GCC]AGCCGCA"

The brackets indicate nucleotide "insertions" relative to the reference 
sequence ("ref.sequence"). Some sequences may have some/all of the insertions, 
some may not.

What I want is for all of the positions to "align" (line up) properly. 
Therefore, the sequences lacking a particular insertion should get scored with 
a dash (or dashes) at that position.

I want to end up with this:

ref.sequence should look like this: "AT---AGCCG-C---A"
sequence1 should look like this: "AT[G][C][C]AGCCG[T]C---A"
sequence2 should look like this: "AT---AGCCG-C[C][A][C]A"
sequence3 should look like this: "AT[G][C][C]AGCCG-C---A"

So how can I make this happen efficiently?

Thanks very much in advance,
-----------------------------------
Josh Banta, Ph.D
Assistant Professor
Department of Biology
The University of Texas at Tyler
Tyler, TX 75799
Tel: (903) 565-5655
http://plantevolutionaryecology.org

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to