Thank you David Wolfskill, David Winsemius, and Gabor! All very helpful and interesting fixes for the problem (compiled below)! Now I will see which one works best on the 944 rows that each have a cell of smooshed attributes...the attribute names should be the same in all the rows, if there is any mercy :)
Joe Ceradini University of Wyoming ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ On 10/14/16, David Wolfskill <da...@catwhisker.org> wrote: > Happy Friday, indeed. > > It seems to me that the data need a bit of cleamup before attempting to > parse -- for example, that "F" looks to be improperly delimited by ':' > on either side. I can't tell from a single example if that's typical > (either for that field, or for random fields throughout the complete > dataset). On the off-chance it's the former, here's a bit of exercise > that may lead you a bit closer to a solution: > > First, starting with "ugly": > >> ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond: Water >> pH:Unkwn: Conductivity:Unkwn: Water color: Clear: Water turbidity: clear: >> Manmade:no Permanence:permanent: Max water depth: <3: Primary substrate: >> Silt/Mud: Evidence of cattle grazing: none: Shoreline Emergent Veg(%): >> 1-25: Fish present: yes: Fish species: unkwn: no amphibians observed") >> ugly > [1] "Water temp:14: F Waterbody type:Permanent Lake/Pond: Water pH:Unkwn: > Conductivity:Unkwn: Water color: Clear: Water turbidity: clear: Manmade:no > Permanence:permanent: Max water depth: <3: Primary substrate: Silt/Mud: > Evidence of cattle grazing: none: Shoreline Emergent Veg(%): 1-25: Fish > present: yes: Fish species: unkwn: no amphibians observed" > > # First, see what a naive strsplit() does: > >> strsplit(ugly, ":") > [[1]] > [1] "Water temp" "14" > [3] " F Waterbody type" "Permanent Lake/Pond" > [5] " Water pH" "Unkwn" > [7] " Conductivity" "Unkwn" > [9] " Water color" " Clear" > [11] " Water turbidity" " clear" > [13] " Manmade" "no Permanence" > [15] "permanent" " Max water depth" > [17] " <3" " Primary substrate" > [19] " Silt/Mud" " Evidence of cattle grazing" > [21] " none" " Shoreline Emergent Veg(%)" > [23] " 1-25" " Fish present" > [25] " yes" " Fish species" > [27] " unkwn" " no amphibians observed" > > # OK; let's fix the "F": > >> ugly1 <- sub(": F ", "F: ", ugly) >> ugly1 > [1] "Water temp:14F: Waterbody type:Permanent Lake/Pond: Water pH:Unkwn: > Conductivity:Unkwn: Water color: Clear: Water turbidity: clear: Manmade:no > Permanence:permanent: Max water depth: <3: Primary substrate: Silt/Mud: > Evidence of cattle grazing: none: Shoreline Emergent Veg(%): 1-25: Fish > present: yes: Fish species: unkwn: no amphibians observed" > > # Now, that substring "Manmade:no Permanence:permanent:" is problematic; > # the " " in there should apparently be ": " -- but we can't just do that > # to all " " substrings, because that would also affect > # "Permanence:permanent: Max water depth: <3:" -- the differnce, though, > # is that the one we don't want to change contains ": ", so let's change > # those. I'm assuming(!) that we don't really care about leading or > # trailing spaces in the fields: > >> ugly2 <- gsub(" *: *", ":", ugly1) >> ugly2 > [1] "Water temp:14F:Waterbody type:Permanent Lake/Pond:Water > pH:Unkwn:Conductivity:Unkwn:Water color:Clear:Water > turbidity:clear:Manmade:no Permanence:permanent:Max water depth:<3:Primary > substrate:Silt/Mud:Evidence of cattle grazing:none:Shoreline Emergent > Veg(%):1-25:Fish present:yes:Fish species:unkwn:no amphibians observed" > > # Now that " " shows up like a sore thumb. Just to make the point even > # clearer, try the "naive" strsplit on what we have: > >> strsplit(ugly2, ":") > [[1]] > [1] "Water temp" "14F" > [3] "Waterbody type" "Permanent Lake/Pond" > [5] "Water pH" "Unkwn" > [7] "Conductivity" "Unkwn" > [9] "Water color" "Clear" > [11] "Water turbidity" "clear" > [13] "Manmade" "no Permanence" > [15] "permanent" "Max water depth" > [17] "<3" "Primary substrate" > [19] "Silt/Mud" "Evidence of cattle grazing" > [21] "none" "Shoreline Emergent Veg(%)" > [23] "1-25" "Fish present" > [25] "yes" "Fish species" > [27] "unkwn" "no amphibians observed" > >> > > # Note element [14]: that's the one we need to fix. I'll assume(!) > # that that sort of thing may occur just about anywhere, so let's just > # whack 'em all: > >> ugly3 <- gsub(" ", ":", ugly2) >> ugly3 > [1] "Water temp:14F:Waterbody type:Permanent Lake/Pond:Water > pH:Unkwn:Conductivity:Unkwn:Water color:Clear:Water > turbidity:clear:Manmade:no:Permanence:permanent:Max water depth:<3:Primary > substrate:Silt/Mud:Evidence of cattle grazing:none:Shoreline Emergent > Veg(%):1-25:Fish present:yes:Fish species:unkwn:no amphibians observed" > > # Again, check a naive strsplpit(): > >> strsplit(ugly3, ":") > [[1]] > [1] "Water temp" "14F" > [3] "Waterbody type" "Permanent Lake/Pond" > [5] "Water pH" "Unkwn" > [7] "Conductivity" "Unkwn" > [9] "Water color" "Clear" > [11] "Water turbidity" "clear" > [13] "Manmade" "no" > [15] "Permanence" "permanent" > [17] "Max water depth" "<3" > [19] "Primary substrate" "Silt/Mud" > [21] "Evidence of cattle grazing" "none" > [23] "Shoreline Emergent Veg(%)" "1-25" > [25] "Fish present" "yes" > [27] "Fish species" "unkwn" > [29] "no amphibians observed" > >> > > # OK; not what we want, but it's a lot closer. Now, watch this: > >> ugly4 <- gsub("([^:]*:[^:]*): *", "\\1\001", ugly3, perl = TRUE) >> strsplit(ugly4, "\001") > [[1]] > [1] "Water temp:14F" "Waterbody type:Permanent > Lake/Pond" > [3] "Water pH:Unkwn" "Conductivity:Unkwn" > > [5] "Water color:Clear" "Water turbidity:clear" > > [7] "Manmade:no" "Permanence:permanent" > > [9] "Max water depth:<3" "Primary substrate:Silt/Mud" > > [11] "Evidence of cattle grazing:none" "Shoreline Emergent Veg(%):1-25" > > [13] "Fish present:yes" "Fish species:unkwn" > > [15] "no amphibians observed" > >> > > # At this point, at least elements [1] - [14] are each of the form > # "tag:value", and thus, readily parsable. Element [15] appears to be > # a somewhat-random comment; I suppose you could check for elements that > # lack a (single) ':' and treat them "specially".... > > I hope that helps. Good luck! > > Peace, > david > -- > David H. Wolfskill da...@catwhisker.org > Those who would murder in the name of God or prophet are blasphemous > cowards. > > See http://www.catwhisker.org/~david/publickey.gpg for my public key. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ On 10/15/16, Gabor Grothendieck <ggrothendi...@gmail.com> wrote: > Replace newlines and colons with a space since they seem to be junk, > generate a pattern to replace the attributes with a comma and do the > replacement and finally read in what is left into a data frame using > the attributes as column names. > > (I have indented each line of code below by 2 spaces so if any line > starts before that then it's been wrapped around by the email and > needs to be adjusted.) > > attributes <- > c("Water temp", "Waterbody type", "Water pH", "Conductivity", > "Water color", "Water turbidity", "Manmade", "Permanence", "Max water > depth", > "Primary substrate", "Evidence of cattle grazing", "Shoreline > Emergent Veg(%)", > "Fish present", "Fish species") > > ugly2 <- gsub("[:\n]", " ", ugly) > > pat <- paste(gsub("([[:punct:]])", ".", attributes), collapse = "|") > ugly3 <- gsub(pat, ",", ugly2) > > dd <- read.table(text = ugly3, sep = ",", strip.white = TRUE, > col.names = c("", attributes))[-1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ On 10/15/16, David Winsemius <dwinsem...@comcast.net> wrote: > >> On Oct 14, 2016, at 6:53 PM, Joe Ceradini <joecerad...@gmail.com> wrote: >> >> Hopefully this looks better. I did not realize gmail default was html. >> >> I have a dataframe with a column that has many field smashed together. >> I need to split the strings in the column into separate columns based >> on patterns. >> >> Example of a string that needs to be split: >> >> ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond: Water >> pH:Unkwn: Conductivity:Unkwn: Water color: Clear: Water turbidity: >> clear: Manmade:no Permanence:permanent: Max water depth: <3: Primary >> substrate: Silt/Mud: Evidence of cattle grazing: none: Shoreline >> Emergent Veg(%): 1-25: Fish present: yes: Fish species: unkwn: no >> amphibians observed") >> ugly >> >> Far as I can tell, there is not a single pattern that would work for >> splitting. Splitting on ":" is close, but not quite right. Each of the >> below attributes should be in a separate column, and are present in >> the string (above) that needs to be split: >> >> attributes <- c("Water temp", "Waterbody type", "Water pH", >> "Conductivity", "Water color", "Water turbidity", "Manmade", >> "Permanence", "Max water depth", "Primary substrate", "Evidence of >> cattle grazing", "Shoreline Emergent Veg(%)", "Fish present", "Fish >> species") >> >> Conceptually, I want to use the vector of attributes to split the >> string. However, strsplit only uses the 1st value of the attributes >> object: >> >> strplit(ugly, attributes). > > I tried this: > > strsplit( ugly, split=paste0(attributes, collapse="|") ) > > And noticed soem of hte attributes were not actually splitting so went back > and did the data entry after making sure that there were no "\n"'s in the > middle of attribute names: > > dput(attributes) > c("Water temp", "Waterbody type", "Water pH", "Conductivity", > "Water color", "Water turbidity", "Manmade", "Permanence", "Max water > depth", > "Primary substrate", "Evidence of cattle grazing", "Shoreline Emergent > Veg(%)", > "Fish present", "Fish species") > > strsplit( ugly, split=paste0(attributes, collapse="|") ) > [[1]] > [1] "" > > [2] ":14: F " > > [3] ":Permanent Lake/Pond: Water\npH:Unkwn: " > > [4] ":Unkwn: " > > [5] ": Clear: " > > [6] ":\nclear: " > > [7] ":no " > > [8] ":permanent: " > > [9] ": <3: Primary\nsubstrate: Silt/Mud: Evidence of cattle grazing: none: > Shoreline\nEmergent Veg(%): 1-25: " > [10] ": yes: Fish species: unkwn: no\namphibians observed" > >> >> Should I loop through the values of "attributes"? >> Is there an argument in strsplit I'm missing that will do what I want? \\ > > I don't think strsplit has such an argument. There may be packages that will > support this. Perhaps the gubfn package? > > >> Different approach altogether? >> >> Thanks! Happy Friday. >> Joe >> >> ______________________________________________ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > David Winsemius > Alameda, CA, USA > ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.