Hi:
This is a bit of a roundabout approach; I'm sure that folks with
regex
expertise will trump this in a heartbeat. I modified the last piece
of
the string a bit to accommodate the approach below. Depending on
where
the strings have line breaks, you may have some odd '\n' characters
inserted.
# Step 1: read the input as a single character string
u <- "SpeciesCommon=(Human);SpeciesScientific=(Homo
sapiens);ReactiveCentres=(N,C,C,C,+H,O,C,C,C,C,O,H);BondInvolved=(C-
H);EzCatDBID=(S00343);BondFormed=(O-H,O-
H);Bond=(255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,
502,A);CatalyticSwissProt=(P25006);SpeciesScientific=(Achromobacter
cycloclastes);SpeciesCommon=(Bacteria);Reactive=(Ce+)"
# Step 2: Split input lines by the ';' delimiter and then use
lapply()
to split variable names from values.
# This results in a nested list for ulist2.
ulist <- strsplit(u, ';')
ulist2 <- lapply(ulist, function(s) strsplit(s, '='))
# Step 3: Break out the results into a matrix whose first column is
the variable name
# and whose second column is the value (with parens included)
# This avoids dealing with nested lists
v <- matrix(unlist(ulist2), ncol = 2, byrow = TRUE)
# Step 4: Strip off the parens
w <- apply(v, 2, function(s) gsub('([\\(\\)])', '', s))
colnames(w) <- c('Name', 'Value')
w
Name Value
[1,] "SpeciesCommon" "Human"
[2,] "SpeciesScientific" "Homo sapiens"
[3,] "ReactiveCentres" "N,C,C,C,+H,O,C,C,C,C,O,H"
[4,] "BondInvolved" "C-H"
[5,] "EzCatDBID" "S00343"
[6,] "BondFormed" "O-H,O-H"
[7,] "Bond" "255B"
[8,] "Cofactors" "CuII,CU,501,A,CuII,CU,502,A"
[9,] "CatalyticSwissProt" "P25006"
[10,] "SpeciesScientific" "Achromobacter\ncycloclastes"
[11,] "SpeciesCommon" "Bacteria"
[12,] "Reactive" "Ce+"
# Step 5: Subset out the values of the SpeciesScientific variables
subset(as.data.frame(w), Name == 'SpeciesScientific', select =
'Value')
Value
2 Homo sapiens
10 Achromobacter\ncycloclastes
One possible 'advantage' of this approach is that if you have a
number
of string records of this type, you can create nested lists for each
string and then manipulate the lists to get what you need. Hopefully
you can use some of these ideas for other purposes as well.
Dennis
On Wed, Apr 20, 2011 at 10:17 AM, Neeti <nikkiha...@gmail.com> wrote:
Hi ALL,
I have very simple question regarding pattern matching. Could
anyone tell
me
how to I can use R to retrieve string pattern from text file. for
example
my file contain following information
SpeciesCommon=(Human);SpeciesScientific=(Homo
sapiens);ReactiveCentres=(N,C,C,C,+
H,O,C,C,C,C,O,H);BondInvolved=(C-
H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+
255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,
502,A);CatalyticSwissProt=(P25006);Sp+
eciesScientific=(Achromobacter
cycloclastes);SpeciesCommon=(Bacteria);ReactiveCe+
and I want to extract SpeciesScientific = (?) information from
this
file.
Problem is in 3rd line where SpeciesScientific word is divided
with +.
Could anyone help me please?
Thank you
--
View this message in context:
http://r.789695.n4.nabble.com/Pattern-match-tp3463625p3463625.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.