On Apr 22, 2011, at 6:42 AM, neetika nath wrote:

>
> Thank you for your message. please see attach file for the template/ 
> test dataset of my file.
>
>
> On Thu, Apr 21, 2011 at 1:30 PM, David Winsemius <dwinsem...@comcast.net 
> > wrote:
>
> On Apr 21, 2011, at 5:27 AM, neetika nath wrote:
>
> Thank you Dennis,
>
> yes the problem is the input file. i have .rdf file and the format  
> is in
> same way i have posted earlier. if i open that file in notepad++ the  
> lines
> are divided or broken  with CR+LF character. so any suggestion to  
> retrieve
> SpeciesScientific information without changing the input file?
>
> You might consider attaching the original file named with an  
> extension of `.txt`, since your verbal description does not match  
> your included example. What I see after the various servers have  
> passed this around and inserted line-ends is the string  
> `SpeciesScientific` in the first line, rather than in the third.
>
  lcon <- file("/Users/davidwinsemius/Downloads/temp_test.txt")
  lines <- readLines(lcon)
  lines
#-----don't paste---
  [1] "--"
  [2] "$DTYPE ROOT:OVERALL REACTION(1):OVERALL REACTION ANNOTATION"
  [3] "lyticCATH=(3.40.50.360);BondOrderChanged=(C-N,1,C=N,2,C=C,2,C-C, 
1,C-C,1,C=C,2,C-+"
  [4] "C,1,C=C,2,C=C,2,C-C,1,C-C,1,C=C,2,C=O,2,C-O,1,C=O,2,C-O, 
1);CatalyticResidues=(Gl+"
  [5] "y149A,Tyr155A,His161A);Cofactors=(FAD,FAD, 
601,none);CatalyticSwissProt=(P15559);+"
  [6] "SpeciesCommon=(Human);SpeciesScientific=(Homo  
sapiens);ReactiveCentres=(N,C,C,C,+"
  [7] "H,O,C,C,C,C,O,H);BondInvolved=(C- 
H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+"
  [8] ""
  [9] "--"
[10] "$DTYPE ROOT:OVERALL REACTION(1):OVERALL REACTION ANNOTATION"
[11] "$DATUM  
CatalyticCATH 
=(2.60.40.420);CatalyticResidues=(Asp98A,His135A,Cys136A,His+"
# end don't paste-------------


# So the first goal is to collapse the broken lines but only within  
boundaries of "--"
# Find the line numbers with "--"
  startidx <- grep("\\-\\-", lines)
  startidx
#[1]  1  9 17
  endidx <- c(startidx[-1]-1, length(lines))
  endidx
#[1]  8 16 25
# Now collapse within those ranges
  unplus <- sapply(1:length(startidx), function(x){
                     gsub("\\+", "",  
paste(lines[startidx[x]:endidx[x]], collapse="") )
                                                  } )
# break on what appears to be the correct delimiter, ";"
  lapply(unplus, function(longline)
                 grep("SpeciesScientific=\\(", strsplit(longline, ";") 
[[1]] ) )
#[[1]]
#[1] 7

#[[2]]
#[1] 5

#[[3]]
#[1] 6
#Seems to succeed (admittedly after some errors that were elided. So  
save it

  lidx <- lapply(unplus, function(longline) grep("SpeciesScientific=\\ 
(", strsplit(longline, ";")[[1]] ) )
#Create a properly split list to work with
  breaklist <- strsplit(unplus, ";")
# And extract the desired elements
  sapply(1:length(startidx), function(idx) breaklist[[idx]] 
[ lidx[[idx]] ] )
#[1] "SpeciesScientific=(Homo sapiens)"                
"SpeciesScientific=(Achromobacter cycloclastes)"
#[3] "SpeciesScientific=(Triticum aestivum)"
# Pulling the species from this simple list is left as a reader's  
exercise

-- 
David

>
> -- 
>
> Thank you
>
> On Wed, Apr 20, 2011 at 9:49 PM, Dennis Murphy <djmu...@gmail.com>  
> wrote:
>
> Hi:
>
> This is a bit of a roundabout approach; I'm sure that folks with regex
> expertise will trump this in a heartbeat. I modified the last piece of
> the string a bit to accommodate the approach below. Depending on where
> the strings have line breaks, you may have some odd '\n' characters
> inserted.
>
> # Step 1: read the input as a single character string
> u <- "SpeciesCommon=(Human);SpeciesScientific=(Homo
>
> sapiens);ReactiveCentres=(N,C,C,C,+H,O,C,C,C,C,O,H);BondInvolved=(C- 
> H);EzCatDBID=(S00343);BondFormed=(O-H,O- 
> H);Bond=(255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU, 
> 502,A);CatalyticSwissProt=(P25006);SpeciesScientific=(Achromobacter
> cycloclastes);SpeciesCommon=(Bacteria);Reactive=(Ce+)"
>
> # Step 2: Split input lines by the ';' delimiter and then use lapply()
> to split variable names from values.
> # This results in a nested list for ulist2.
> ulist <- strsplit(u, ';')
> ulist2 <- lapply(ulist, function(s) strsplit(s, '='))
>
> # Step 3: Break out the results into a matrix whose first column is
> the variable name
> # and whose second column is the value (with parens included)
> # This avoids dealing with nested lists
> v <- matrix(unlist(ulist2), ncol = 2, byrow = TRUE)
>
> # Step 4: Strip off the parens
> w <- apply(v, 2, function(s) gsub('([\\(\\)])', '', s))
> colnames(w) <- c('Name', 'Value')
> w
>    Name                 Value
> [1,] "SpeciesCommon"      "Human"
> [2,] "SpeciesScientific"  "Homo sapiens"
> [3,] "ReactiveCentres"    "N,C,C,C,+H,O,C,C,C,C,O,H"
> [4,] "BondInvolved"       "C-H"
> [5,] "EzCatDBID"          "S00343"
> [6,] "BondFormed"         "O-H,O-H"
> [7,] "Bond"               "255B"
> [8,] "Cofactors"          "CuII,CU,501,A,CuII,CU,502,A"
> [9,] "CatalyticSwissProt" "P25006"
> [10,] "SpeciesScientific"  "Achromobacter\ncycloclastes"
> [11,] "SpeciesCommon"      "Bacteria"
> [12,] "Reactive"           "Ce+"
>
> # Step 5: Subset out the values of the SpeciesScientific variables
> subset(as.data.frame(w), Name == 'SpeciesScientific', select =  
> 'Value')
>                       Value
> 2                 Homo sapiens
> 10 Achromobacter\ncycloclastes
>
>
> One possible 'advantage' of this approach is that if you have a number
> of string records of this type, you can create nested lists for each
> string and then manipulate the lists to get what you need. Hopefully
> you can use some of these ideas for other purposes as well.
>
> Dennis
>
>
>
> On Wed, Apr 20, 2011 at 10:17 AM, Neeti <nikkiha...@gmail.com> wrote:
> Hi ALL,
>
> I have very simple question regarding pattern matching. Could anyone  
> tell
> me
> how to I can use R to retrieve string pattern from text file.  for
> example
> my file contain following information
>
> SpeciesCommon=(Human);SpeciesScientific=(Homo
> sapiens);ReactiveCentres=(N,C,C,C,+
>
> H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O- 
> H,O-H);Bond+
>
> 255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU, 
> 502,A);CatalyticSwissProt=(P25006);Sp+
> eciesScientific=(Achromobacter
> cycloclastes);SpeciesCommon=(Bacteria);ReactiveCe+
>
> and I want to extract “SpeciesScientific = (?)” information from this
> file.
> Problem is in 3rd line where SpeciesScientific word is divided with +.
>
> Could anyone help me please?
> Thank you
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/Pattern-match-tp3463625p3463625.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
>
>        [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius, MD
> West Hartford, CT
>
>
> <temp_test.txt>

David Winsemius, MD
West Hartford, CT


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to