Re: [R] Pattern match

neetika nath Fri, 22 Apr 2011 06:33:05 -0700

Thank you so much.

On Fri, Apr 22, 2011 at 1:29 PM, David Winsemius <dwinsem...@comcast.net>wrote:


>
> On Apr 22, 2011, at 6:42 AM, neetika nath wrote:
>
>
> Thank you for your message. please see attach file for the template/test
> dataset of my file.
>
>
> On Thu, Apr 21, 2011 at 1:30 PM, David Winsemius 
> <dwinsem...@comcast.net>wrote:
>
>>
>> On Apr 21, 2011, at 5:27 AM, neetika nath wrote:
>>
>>  Thank you Dennis,
>>>
>>> yes the problem is the input file. i have .rdf file and the format is in
>>> same way i have posted earlier. if i open that file in notepad++ the
>>> lines
>>> are divided or broken  with CR+LF character. so any suggestion to
>>> retrieve
>>> SpeciesScientific information without changing the input file?
>>>
>>
>> You might consider attaching the original file named with an extension of
>> `.txt`, since your verbal description does not match your included example.
>> What I see after the various servers have passed this around and inserted
>> line-ends is the string `SpeciesScientific` in the first line, rather than
>> in the third.
>>
>>  lcon <- file("/Users/davidwinsemius/Downloads/temp_test.txt")
>  lines <- readLines(lcon)
>  lines
> #-----don't paste---
>  [1] "--"
>
>  [2] "$DTYPE ROOT:OVERALL REACTION(1):OVERALL REACTION ANNOTATION"
>
>  [3]
> "lyticCATH=(3.40.50.360);BondOrderChanged=(C-N,1,C=N,2,C=C,2,C-C,1,C-C,1,C=C,2,C-+"
>  [4]
> "C,1,C=C,2,C=C,2,C-C,1,C-C,1,C=C,2,C=O,2,C-O,1,C=O,2,C-O,1);CatalyticResidues=(Gl+"
>  [5]
> "y149A,Tyr155A,His161A);Cofactors=(FAD,FAD,601,none);CatalyticSwissProt=(P15559);+"
>  [6] "SpeciesCommon=(Human);SpeciesScientific=(Homo
> sapiens);ReactiveCentres=(N,C,C,C,+"
>  [7]
> "H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+"
>  [8] ""
>
>  [9] "--"
>
> [10] "$DTYPE ROOT:OVERALL REACTION(1):OVERALL REACTION ANNOTATION"
>
> [11] "$DATUM
> CatalyticCATH=(2.60.40.420);CatalyticResidues=(Asp98A,His135A,Cys136A,His+"
> # end don't paste-------------
>
>
> # So the first goal is to collapse the broken lines but only within
> boundaries of "--"
> # Find the line numbers with "--"
>  startidx <- grep("\\-\\-", lines)
>  startidx
> #[1]  1  9 17
>  endidx <- c(startidx[-1]-1, length(lines))
>  endidx
> #[1]  8 16 25
> # Now collapse within those ranges
>  unplus <- sapply(1:length(startidx), function(x){
>                     gsub("\\+", "", paste(lines[startidx[x]:endidx[x]],
> collapse="") )
>                                                  } )
> # break on what appears to be the correct delimiter, ";"
>  lapply(unplus, function(longline)
>                 grep("SpeciesScientific=\\(", strsplit(longline, ";")[[1]]
> ) )
> #[[1]]
> #[1] 7
>
> #[[2]]
> #[1] 5
>
> #[[3]]
> #[1] 6
> #Seems to succeed (admittedly after some errors that were elided. So save
> it
>
>  lidx <- lapply(unplus, function(longline) grep("SpeciesScientific=\\(",
> strsplit(longline, ";")[[1]] ) )
> #Create a properly split list to work with
>  breaklist <- strsplit(unplus, ";")
> # And extract the desired elements
>  sapply(1:length(startidx), function(idx) breaklist[[idx]][ lidx[[idx]] ] )
> #[1] "SpeciesScientific=(Homo sapiens)"
> "SpeciesScientific=(Achromobacter cycloclastes)"
> #[3] "SpeciesScientific=(Triticum aestivum)"
> # Pulling the species from this simple list is left as a reader's exercise
>
> --
>> David
>>
>
>>
>> --
>>
>>>
>>> Thank you
>>>
>>> On Wed, Apr 20, 2011 at 9:49 PM, Dennis Murphy <djmu...@gmail.com>
>>> wrote:
>>>
>>>  Hi:
>>>>
>>>> This is a bit of a roundabout approach; I'm sure that folks with regex
>>>> expertise will trump this in a heartbeat. I modified the last piece of
>>>> the string a bit to accommodate the approach below. Depending on where
>>>> the strings have line breaks, you may have some odd '\n' characters
>>>> inserted.
>>>>
>>>> # Step 1: read the input as a single character string
>>>> u <- "SpeciesCommon=(Human);SpeciesScientific=(Homo
>>>>
>>>>
>>>> sapiens);ReactiveCentres=(N,C,C,C,+H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond=(255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);SpeciesScientific=(Achromobacter
>>>> cycloclastes);SpeciesCommon=(Bacteria);Reactive=(Ce+)"
>>>>
>>>> # Step 2: Split input lines by the ';' delimiter and then use lapply()
>>>> to split variable names from values.
>>>> # This results in a nested list for ulist2.
>>>> ulist <- strsplit(u, ';')
>>>> ulist2 <- lapply(ulist, function(s) strsplit(s, '='))
>>>>
>>>> # Step 3: Break out the results into a matrix whose first column is
>>>> the variable name
>>>> # and whose second column is the value (with parens included)
>>>> # This avoids dealing with nested lists
>>>> v <- matrix(unlist(ulist2), ncol = 2, byrow = TRUE)
>>>>
>>>> # Step 4: Strip off the parens
>>>> w <- apply(v, 2, function(s) gsub('([\\(\\)])', '', s))
>>>> colnames(w) <- c('Name', 'Value')
>>>> w
>>>>    Name                 Value
>>>> [1,] "SpeciesCommon"      "Human"
>>>> [2,] "SpeciesScientific"  "Homo sapiens"
>>>> [3,] "ReactiveCentres"    "N,C,C,C,+H,O,C,C,C,C,O,H"
>>>> [4,] "BondInvolved"       "C-H"
>>>> [5,] "EzCatDBID"          "S00343"
>>>> [6,] "BondFormed"         "O-H,O-H"
>>>> [7,] "Bond"               "255B"
>>>> [8,] "Cofactors"          "CuII,CU,501,A,CuII,CU,502,A"
>>>> [9,] "CatalyticSwissProt" "P25006"
>>>> [10,] "SpeciesScientific"  "Achromobacter\ncycloclastes"
>>>> [11,] "SpeciesCommon"      "Bacteria"
>>>> [12,] "Reactive"           "Ce+"
>>>>
>>>> # Step 5: Subset out the values of the SpeciesScientific variables
>>>> subset(as.data.frame(w), Name == 'SpeciesScientific', select = 'Value')
>>>>                       Value
>>>> 2                 Homo sapiens
>>>> 10 Achromobacter\ncycloclastes
>>>>
>>>>
>>>> One possible 'advantage' of this approach is that if you have a number
>>>> of string records of this type, you can create nested lists for each
>>>> string and then manipulate the lists to get what you need. Hopefully
>>>> you can use some of these ideas for other purposes as well.
>>>>
>>>> Dennis
>>>>
>>>>
>>>>
>>>> On Wed, Apr 20, 2011 at 10:17 AM, Neeti <nikkiha...@gmail.com> wrote:
>>>>
>>>>> Hi ALL,
>>>>>
>>>>> I have very simple question regarding pattern matching. Could anyone
>>>>> tell
>>>>>
>>>> me
>>>>
>>>>> how to I can use R to retrieve string pattern from text file.  for
>>>>>
>>>> example
>>>>
>>>>> my file contain following information
>>>>>
>>>>> SpeciesCommon=(Human);SpeciesScientific=(Homo
>>>>> sapiens);ReactiveCentres=(N,C,C,C,+
>>>>>
>>>>> H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+
>>>>
>>>>>
>>>>> 255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);Sp+
>>>>
>>>>> eciesScientific=(Achromobacter
>>>>> cycloclastes);SpeciesCommon=(Bacteria);ReactiveCe+
>>>>>
>>>>> and I want to extract SpeciesScientific = (?) information from this
>>>>>
>>>> file.
>>>>
>>>>> Problem is in 3rd line where SpeciesScientific word is divided with +.
>>>>>
>>>>> Could anyone help me please?
>>>>> Thank you
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>>
>>>> http://r.789695.n4.nabble.com/Pattern-match-tp3463625p3463625.html
>>>>
>>>>> Sent from the R help mailing list archive at Nabble.com.
>>>>>
>>>>> ______________________________________________
>>>>> R-help@r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>>
>>>> http://www.R-project.org/posting-guide.html
>>>>
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>>
>>>>
>>>         [[alternative HTML version deleted]]
>>>
>>>
>>> ______________________________________________
>>> R-help@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
>>
> <temp_test.txt>
>
>
>  David Winsemius, MD
> West Hartford, CT
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Pattern match

Reply via email to