Gabor, Thank you very very much! Your code consists of so many commands I never played with, both helpful and educational.
Thanks! Tal ----------------Contact Details:------------------------------------------------------- Contact me: tal.gal...@gmail.com | 972-52-7275845 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | www.r-statistics.com (English) ---------------------------------------------------------------------------------------------- On Tue, Mar 16, 2010 at 4:24 PM, Gabor Grothendieck <ggrothendi...@gmail.com > wrote: > We show how to use the gsubfn package to parse this. > > The rules are not entirely clear so we will assume the following: > > - there is a fixed template for the output which is the same as your > output but possibly with different character strings filled in. This > implies, for example, that there are exactly Stem0, Stem1, Stem2 and > Stem3 and no fewer or more stems. > > - the sequence always starts with the open of Stem0, at least one dot > and the open of Stem1. There are no dots prior to the open of Stem0. > This seems to be implicit in your sample output since there is no zero > length string in your sample output corresponding to dots prior to > Stem0. > > - Stem0 closes with the same number of < as there are > to open it > > You can modify this yourself to take into account the actual rules > whatever they are. > > We first calculate, k, the number of leading >'s using strapply. > > Then we replace the leading k >'s with }'s and the trailing k <'s with > {'s giving us Str3: > > > "}}}}}}}..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<{{{{{{{." > > We again use strapply, this time to get the lengths of the runs. Note that > zero length runs are possible so we cannot, for example, use rle for this. > For > example there is a zero length run of dots between the last < and the first > {. > read.fwf is used to actually parse out the strings using the lengths we > just > calculated. > > Finally we fill in the template using relist. > > # inputs > > Seq <- > "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA" > Str <- > ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<." > template <- > list( > "Stem 0 opening" = "", > "before Stem 1" = "", > "Stem 1" = list(opening = "", > inside = "", > closing = "" > ), > "between Stem 1 and 2" = "", > "Stem 2" = list(opening = "", > inside = "", > closing = "" > ), > "between Stem 2 and 3" = "", > "Stem 3" = list(opening = "", > inside = "", > closing = "" > ), > "After Stem 3" = "", > "Stem 0 closing" = "" > ) > > # processing > > # create string made by repeating string s k times followed by more > reps <- function(s, k, more = "") { > paste(paste(rep(s, k), collapse = ""), more, sep = "") > } > > library(gsubfn) > k <- nchar(strapply(Str, "^>+", c)[[1]]) > Str2 <- sub("^>+", reps("}", k), Str) > Str3 <- sub(reps("<", k, "([^<]*)$"), reps("{", k, "\\1"), Str2) > > pat <- > > "^(}*)([.]*)(>*)([.]*)(<*)([.]*)(>*)([.]*)(<*)([.]*)(>*)([.]*)(<*)([.]*)({*)([.]*)$" > lens <- sapply(strapply(Str3, pat, c)[[1]], nchar) > tokens <- unlist(read.fwf(textConnection(Seq), lens, as.is = TRUE)) > closeAllConnections() > tokens[is.na(tokens)] <- "" > out <- relist(tokens, template) > out > > > Here is the str of the output for your sample input: > > > str(out) > List of 9 > $ Stem 0 opening : chr "GCCTCGA" > $ before Stem 1 : chr "TA" > $ Stem 1 :List of 3 > ..$ opening: chr "GCTC" > ..$ inside : chr "AGTTGGGA" > ..$ closing: chr "GAGC" > $ between Stem 1 and 2: chr "G" > $ Stem 2 :List of 3 > ..$ opening: chr "TACGA" > ..$ inside : chr "CTGAAGA" > ..$ closing: chr "TCGTA" > $ between Stem 2 and 3: chr "AGGtC" > $ Stem 3 :List of 3 > ..$ opening: chr "ACCAG" > ..$ inside : chr "TTCGATC" > ..$ closing: chr "CTGGT" > $ After Stem 3 : chr "" > $ Stem 0 closing : chr "TCGGGGC" > > > > On Tue, Mar 16, 2010 at 6:10 AM, Tal Galili <tal.gal...@gmail.com> wrote: > > Hello all, > > > > For some work I am doing on RNA, I want to use R to do string parsing > that > > (I think) is like a simplistic HTML parsing. > > > > > > For example, let's say we have the following two variables: > > > > Seq <- > > > "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA" > > Str <- > > > ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<." > > > > Say that I want to parse "Seq" According to "Str", by using the legend > here > > > > Seq: > GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA > > Str: > >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<. > > > > | | | | | | | || > | > > > > +-----+ +--------------+ +---------------+ > +---------------++-----+ > > > > | Stem 1 Stem 2 Stem 3 | > > > > | | > > > > +----------------------------------------------------------------+ > > > > Stem 0 > > > > Assume that we always have 4 stems (0 to 3), but that the length of > letters > > before and after each of them can very. > > > > The output should be something like the following list structure: > > > > > > list( > > "Stem 0 opening" = "GCCTCGA", > > "before Stem 1" = "TA", > > "Stem 1" = list(opening = "GCTC", > > inside = "AGTTGGGA", > > closing = "GAGC" > > ), > > "between Stem 1 and 2" = "G", > > "Stem 2" = list(opening = "TACGA", > > inside = "CTGAAGA", > > closing = "TCGTA" > > ), > > "between Stem 2 and 3" = "AGGtC", > > "Stem 3" = list(opening = "ACCAG", > > inside = "TTCGATC", > > closing = "CTGGT" > > ), > > "After Stem 3" = "", > > "Stem 0 closing" = "TCGGGGC" > > ) > > > > > > I don't have any experience with programming a parser, and would like > > advices as to what strategy to use when programming something like this > (and > > any recommended R commands to use). > > > > > > What I was thinking of is to first get rid of the "Stem 0", then go > through > > the inner string with a recursive function (let's call it > "seperate.stem") > > that each time will split the string into: > > 1. before stem > > 2. opening stem > > 3. inside stem > > 4. closing stem > > 5. after stem > > > > Where the "after stem" will then be recursively entered into the same > > function ("seperate.stem") > > > > The thing is that I am not sure how to try and do this coding without > using > > a loop. > > > > Any advices will be most welcomed. > > > > > > ----------------Contact > > Details:------------------------------------------------------- > > Contact me: tal.gal...@gmail.com | 972-52-7275845 > > Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | > > www.r-statistics.com (English) > > > ---------------------------------------------------------------------------------------------- > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.