On Apr 22, 2011, at 6:42 AM, neetika nath wrote: > > Thank you for your message. please see attach file for the template/ > test dataset of my file. > > > On Thu, Apr 21, 2011 at 1:30 PM, David Winsemius <dwinsem...@comcast.net > > wrote: > > On Apr 21, 2011, at 5:27 AM, neetika nath wrote: > > Thank you Dennis, > > yes the problem is the input file. i have .rdf file and the format > is in > same way i have posted earlier. if i open that file in notepad++ the > lines > are divided or broken with CR+LF character. so any suggestion to > retrieve > SpeciesScientific information without changing the input file? > > You might consider attaching the original file named with an > extension of `.txt`, since your verbal description does not match > your included example. What I see after the various servers have > passed this around and inserted line-ends is the string > `SpeciesScientific` in the first line, rather than in the third. > lcon <- file("/Users/davidwinsemius/Downloads/temp_test.txt") lines <- readLines(lcon) lines #-----don't paste--- [1] "--" [2] "$DTYPE ROOT:OVERALL REACTION(1):OVERALL REACTION ANNOTATION" [3] "lyticCATH=(3.40.50.360);BondOrderChanged=(C-N,1,C=N,2,C=C,2,C-C, 1,C-C,1,C=C,2,C-+" [4] "C,1,C=C,2,C=C,2,C-C,1,C-C,1,C=C,2,C=O,2,C-O,1,C=O,2,C-O, 1);CatalyticResidues=(Gl+" [5] "y149A,Tyr155A,His161A);Cofactors=(FAD,FAD, 601,none);CatalyticSwissProt=(P15559);+" [6] "SpeciesCommon=(Human);SpeciesScientific=(Homo sapiens);ReactiveCentres=(N,C,C,C,+" [7] "H,O,C,C,C,C,O,H);BondInvolved=(C- H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+" [8] "" [9] "--" [10] "$DTYPE ROOT:OVERALL REACTION(1):OVERALL REACTION ANNOTATION" [11] "$DATUM CatalyticCATH =(2.60.40.420);CatalyticResidues=(Asp98A,His135A,Cys136A,His+" # end don't paste-------------
# So the first goal is to collapse the broken lines but only within boundaries of "--" # Find the line numbers with "--" startidx <- grep("\\-\\-", lines) startidx #[1] 1 9 17 endidx <- c(startidx[-1]-1, length(lines)) endidx #[1] 8 16 25 # Now collapse within those ranges unplus <- sapply(1:length(startidx), function(x){ gsub("\\+", "", paste(lines[startidx[x]:endidx[x]], collapse="") ) } ) # break on what appears to be the correct delimiter, ";" lapply(unplus, function(longline) grep("SpeciesScientific=\\(", strsplit(longline, ";") [[1]] ) ) #[[1]] #[1] 7 #[[2]] #[1] 5 #[[3]] #[1] 6 #Seems to succeed (admittedly after some errors that were elided. So save it lidx <- lapply(unplus, function(longline) grep("SpeciesScientific=\\ (", strsplit(longline, ";")[[1]] ) ) #Create a properly split list to work with breaklist <- strsplit(unplus, ";") # And extract the desired elements sapply(1:length(startidx), function(idx) breaklist[[idx]] [ lidx[[idx]] ] ) #[1] "SpeciesScientific=(Homo sapiens)" "SpeciesScientific=(Achromobacter cycloclastes)" #[3] "SpeciesScientific=(Triticum aestivum)" # Pulling the species from this simple list is left as a reader's exercise -- David > > -- > > Thank you > > On Wed, Apr 20, 2011 at 9:49 PM, Dennis Murphy <djmu...@gmail.com> > wrote: > > Hi: > > This is a bit of a roundabout approach; I'm sure that folks with regex > expertise will trump this in a heartbeat. I modified the last piece of > the string a bit to accommodate the approach below. Depending on where > the strings have line breaks, you may have some odd '\n' characters > inserted. > > # Step 1: read the input as a single character string > u <- "SpeciesCommon=(Human);SpeciesScientific=(Homo > > sapiens);ReactiveCentres=(N,C,C,C,+H,O,C,C,C,C,O,H);BondInvolved=(C- > H);EzCatDBID=(S00343);BondFormed=(O-H,O- > H);Bond=(255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU, > 502,A);CatalyticSwissProt=(P25006);SpeciesScientific=(Achromobacter > cycloclastes);SpeciesCommon=(Bacteria);Reactive=(Ce+)" > > # Step 2: Split input lines by the ';' delimiter and then use lapply() > to split variable names from values. > # This results in a nested list for ulist2. > ulist <- strsplit(u, ';') > ulist2 <- lapply(ulist, function(s) strsplit(s, '=')) > > # Step 3: Break out the results into a matrix whose first column is > the variable name > # and whose second column is the value (with parens included) > # This avoids dealing with nested lists > v <- matrix(unlist(ulist2), ncol = 2, byrow = TRUE) > > # Step 4: Strip off the parens > w <- apply(v, 2, function(s) gsub('([\\(\\)])', '', s)) > colnames(w) <- c('Name', 'Value') > w > Name Value > [1,] "SpeciesCommon" "Human" > [2,] "SpeciesScientific" "Homo sapiens" > [3,] "ReactiveCentres" "N,C,C,C,+H,O,C,C,C,C,O,H" > [4,] "BondInvolved" "C-H" > [5,] "EzCatDBID" "S00343" > [6,] "BondFormed" "O-H,O-H" > [7,] "Bond" "255B" > [8,] "Cofactors" "CuII,CU,501,A,CuII,CU,502,A" > [9,] "CatalyticSwissProt" "P25006" > [10,] "SpeciesScientific" "Achromobacter\ncycloclastes" > [11,] "SpeciesCommon" "Bacteria" > [12,] "Reactive" "Ce+" > > # Step 5: Subset out the values of the SpeciesScientific variables > subset(as.data.frame(w), Name == 'SpeciesScientific', select = > 'Value') > Value > 2 Homo sapiens > 10 Achromobacter\ncycloclastes > > > One possible 'advantage' of this approach is that if you have a number > of string records of this type, you can create nested lists for each > string and then manipulate the lists to get what you need. Hopefully > you can use some of these ideas for other purposes as well. > > Dennis > > > > On Wed, Apr 20, 2011 at 10:17 AM, Neeti <nikkiha...@gmail.com> wrote: > Hi ALL, > > I have very simple question regarding pattern matching. Could anyone > tell > me > how to I can use R to retrieve string pattern from text file. for > example > my file contain following information > > SpeciesCommon=(Human);SpeciesScientific=(Homo > sapiens);ReactiveCentres=(N,C,C,C,+ > > H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O- > H,O-H);Bond+ > > 255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU, > 502,A);CatalyticSwissProt=(P25006);Sp+ > eciesScientific=(Achromobacter > cycloclastes);SpeciesCommon=(Bacteria);ReactiveCe+ > > and I want to extract SpeciesScientific = (?) information from this > file. > Problem is in 3rd line where SpeciesScientific word is divided with +. > > Could anyone help me please? > Thank you > > > -- > View this message in context: > http://r.789695.n4.nabble.com/Pattern-match-tp3463625p3463625.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > > > [[alternative HTML version deleted]] > > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > David Winsemius, MD > West Hartford, CT > > > <temp_test.txt> David Winsemius, MD West Hartford, CT [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.