Thank you so much. On Fri, Apr 22, 2011 at 1:29 PM, David Winsemius <dwinsem...@comcast.net>wrote:
> > On Apr 22, 2011, at 6:42 AM, neetika nath wrote: > > > Thank you for your message. please see attach file for the template/test > dataset of my file. > > > On Thu, Apr 21, 2011 at 1:30 PM, David Winsemius > <dwinsem...@comcast.net>wrote: > >> >> On Apr 21, 2011, at 5:27 AM, neetika nath wrote: >> >> Thank you Dennis, >>> >>> yes the problem is the input file. i have .rdf file and the format is in >>> same way i have posted earlier. if i open that file in notepad++ the >>> lines >>> are divided or broken with CR+LF character. so any suggestion to >>> retrieve >>> SpeciesScientific information without changing the input file? >>> >> >> You might consider attaching the original file named with an extension of >> `.txt`, since your verbal description does not match your included example. >> What I see after the various servers have passed this around and inserted >> line-ends is the string `SpeciesScientific` in the first line, rather than >> in the third. >> >> lcon <- file("/Users/davidwinsemius/Downloads/temp_test.txt") > lines <- readLines(lcon) > lines > #-----don't paste--- > [1] "--" > > [2] "$DTYPE ROOT:OVERALL REACTION(1):OVERALL REACTION ANNOTATION" > > [3] > "lyticCATH=(3.40.50.360);BondOrderChanged=(C-N,1,C=N,2,C=C,2,C-C,1,C-C,1,C=C,2,C-+" > [4] > "C,1,C=C,2,C=C,2,C-C,1,C-C,1,C=C,2,C=O,2,C-O,1,C=O,2,C-O,1);CatalyticResidues=(Gl+" > [5] > "y149A,Tyr155A,His161A);Cofactors=(FAD,FAD,601,none);CatalyticSwissProt=(P15559);+" > [6] "SpeciesCommon=(Human);SpeciesScientific=(Homo > sapiens);ReactiveCentres=(N,C,C,C,+" > [7] > "H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+" > [8] "" > > [9] "--" > > [10] "$DTYPE ROOT:OVERALL REACTION(1):OVERALL REACTION ANNOTATION" > > [11] "$DATUM > CatalyticCATH=(2.60.40.420);CatalyticResidues=(Asp98A,His135A,Cys136A,His+" > # end don't paste------------- > > > # So the first goal is to collapse the broken lines but only within > boundaries of "--" > # Find the line numbers with "--" > startidx <- grep("\\-\\-", lines) > startidx > #[1] 1 9 17 > endidx <- c(startidx[-1]-1, length(lines)) > endidx > #[1] 8 16 25 > # Now collapse within those ranges > unplus <- sapply(1:length(startidx), function(x){ > gsub("\\+", "", paste(lines[startidx[x]:endidx[x]], > collapse="") ) > } ) > # break on what appears to be the correct delimiter, ";" > lapply(unplus, function(longline) > grep("SpeciesScientific=\\(", strsplit(longline, ";")[[1]] > ) ) > #[[1]] > #[1] 7 > > #[[2]] > #[1] 5 > > #[[3]] > #[1] 6 > #Seems to succeed (admittedly after some errors that were elided. So save > it > > lidx <- lapply(unplus, function(longline) grep("SpeciesScientific=\\(", > strsplit(longline, ";")[[1]] ) ) > #Create a properly split list to work with > breaklist <- strsplit(unplus, ";") > # And extract the desired elements > sapply(1:length(startidx), function(idx) breaklist[[idx]][ lidx[[idx]] ] ) > #[1] "SpeciesScientific=(Homo sapiens)" > "SpeciesScientific=(Achromobacter cycloclastes)" > #[3] "SpeciesScientific=(Triticum aestivum)" > # Pulling the species from this simple list is left as a reader's exercise > > -- >> David >> > >> >> -- >> >>> >>> Thank you >>> >>> On Wed, Apr 20, 2011 at 9:49 PM, Dennis Murphy <djmu...@gmail.com> >>> wrote: >>> >>> Hi: >>>> >>>> This is a bit of a roundabout approach; I'm sure that folks with regex >>>> expertise will trump this in a heartbeat. I modified the last piece of >>>> the string a bit to accommodate the approach below. Depending on where >>>> the strings have line breaks, you may have some odd '\n' characters >>>> inserted. >>>> >>>> # Step 1: read the input as a single character string >>>> u <- "SpeciesCommon=(Human);SpeciesScientific=(Homo >>>> >>>> >>>> sapiens);ReactiveCentres=(N,C,C,C,+H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond=(255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);SpeciesScientific=(Achromobacter >>>> cycloclastes);SpeciesCommon=(Bacteria);Reactive=(Ce+)" >>>> >>>> # Step 2: Split input lines by the ';' delimiter and then use lapply() >>>> to split variable names from values. >>>> # This results in a nested list for ulist2. >>>> ulist <- strsplit(u, ';') >>>> ulist2 <- lapply(ulist, function(s) strsplit(s, '=')) >>>> >>>> # Step 3: Break out the results into a matrix whose first column is >>>> the variable name >>>> # and whose second column is the value (with parens included) >>>> # This avoids dealing with nested lists >>>> v <- matrix(unlist(ulist2), ncol = 2, byrow = TRUE) >>>> >>>> # Step 4: Strip off the parens >>>> w <- apply(v, 2, function(s) gsub('([\\(\\)])', '', s)) >>>> colnames(w) <- c('Name', 'Value') >>>> w >>>> Name Value >>>> [1,] "SpeciesCommon" "Human" >>>> [2,] "SpeciesScientific" "Homo sapiens" >>>> [3,] "ReactiveCentres" "N,C,C,C,+H,O,C,C,C,C,O,H" >>>> [4,] "BondInvolved" "C-H" >>>> [5,] "EzCatDBID" "S00343" >>>> [6,] "BondFormed" "O-H,O-H" >>>> [7,] "Bond" "255B" >>>> [8,] "Cofactors" "CuII,CU,501,A,CuII,CU,502,A" >>>> [9,] "CatalyticSwissProt" "P25006" >>>> [10,] "SpeciesScientific" "Achromobacter\ncycloclastes" >>>> [11,] "SpeciesCommon" "Bacteria" >>>> [12,] "Reactive" "Ce+" >>>> >>>> # Step 5: Subset out the values of the SpeciesScientific variables >>>> subset(as.data.frame(w), Name == 'SpeciesScientific', select = 'Value') >>>> Value >>>> 2 Homo sapiens >>>> 10 Achromobacter\ncycloclastes >>>> >>>> >>>> One possible 'advantage' of this approach is that if you have a number >>>> of string records of this type, you can create nested lists for each >>>> string and then manipulate the lists to get what you need. Hopefully >>>> you can use some of these ideas for other purposes as well. >>>> >>>> Dennis >>>> >>>> >>>> >>>> On Wed, Apr 20, 2011 at 10:17 AM, Neeti <nikkiha...@gmail.com> wrote: >>>> >>>>> Hi ALL, >>>>> >>>>> I have very simple question regarding pattern matching. Could anyone >>>>> tell >>>>> >>>> me >>>> >>>>> how to I can use R to retrieve string pattern from text file. for >>>>> >>>> example >>>> >>>>> my file contain following information >>>>> >>>>> SpeciesCommon=(Human);SpeciesScientific=(Homo >>>>> sapiens);ReactiveCentres=(N,C,C,C,+ >>>>> >>>>> H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+ >>>> >>>>> >>>>> 255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);Sp+ >>>> >>>>> eciesScientific=(Achromobacter >>>>> cycloclastes);SpeciesCommon=(Bacteria);ReactiveCe+ >>>>> >>>>> and I want to extract SpeciesScientific = (?) information from this >>>>> >>>> file. >>>> >>>>> Problem is in 3rd line where SpeciesScientific word is divided with +. >>>>> >>>>> Could anyone help me please? >>>>> Thank you >>>>> >>>>> >>>>> -- >>>>> View this message in context: >>>>> >>>> http://r.789695.n4.nabble.com/Pattern-match-tp3463625p3463625.html >>>> >>>>> Sent from the R help mailing list archive at Nabble.com. >>>>> >>>>> ______________________________________________ >>>>> R-help@r-project.org mailing list >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide >>>>> >>>> http://www.R-project.org/posting-guide.html >>>> >>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>>> >>>> >>> [[alternative HTML version deleted]] >>> >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> David Winsemius, MD >> West Hartford, CT >> >> > <temp_test.txt> > > > David Winsemius, MD > West Hartford, CT > > [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.