Re: [R] How to remove square brackets, etc. from address strings?

Rui Barradas Sun, 27 May 2012 16:06:48 -0700

Hello,

Em 27-05-2012 22:12, Sabina Arndt escreveu:

Hello r-help members,
thank you very much for your reply, Rui Barradas.
Unfortunately, I'm not sure if I understand it correctly: I don't knowhow to create the vector's second element y that way. The pattern youused has to be extracted from the address strings first. This is morecomplex as I'd tried to explain in my previous posts. It finally seemsto work now.

Your data file has more than one line. I've called it "sabrina.txt" andthen processed with:


x <- readLines("sabrina.txt")

s <- strsplit(x, ";[[:space:]]\\[")
r <- lapply(s, function(x) sapply(strsplit(x, "[[:blank:]]"), tail, 1))

length(r)
[1] 21

So a vector 'y' and 19 other would have been created.

Do you happen to have any idea on how I could put the country namesback into their original lines / order, though?


r[[21]] <- NULL
r[[20]] <- r[[20]][ -length(r[[20]]) ]
r1 <- lapply(r, function(x) x[nchar(x) > 0])
country.list <- r1[ -which(sapply(r1, function(x) is.null(x))) ]
# clean up
rm(s, r, r1)

# See what we have
country.list

As far as I can tell they're in the original order. But what do you meanby "back into their original lines"?

Thank you very much in advance!

Any time, glad to help.

Rui Barradas

Faithfully yours,

Sabina Arndt


Am 27.05.2012 19:04, schrieb Rui Barradas:
Hello,
Though I've not been following this thread, it seems like a regularexpressions problem.In the code below, I've created a 'testdata' variable based on yourpost.
# create a vector with two elements.
x <- "[Engel, Kathrin M. Y.; Schroeck, ... etc ...
y <- gsub("Germany", "Portugal", x)
testdata <- c(x, y)
# 's' is a list of character vectors, each element's final word is acountry
s <- strsplit(testdata, ";[[:space:]]+\\[")
lapply(s, function(x) sapply(strsplit(x, "[[:blank:]]"), tail, 1))


If this isn't it, sorry for the intrusion.

Rui Barradas

Em 27-05-2012 17:29, Sabina Arndt escreveu:
Hello r-help members,
I'm very grateful for the reply which Sarah Goslee sent to me insuch a prompt and helpful manner.It took me some time, but with a few amendments her suggestion nowworks not only for an example but for my entire data file as well:
> results
  [1] "GERMANY"         "GERMANY"         "GERMANY"        "GERMANY"
  [5] "GERMANY"         "GERMANY"         "GERMANY"        "GERMANY"
...

Thank you very much for that, dear Sarah!
All these names actually belong to the very first record, though,which contains eight addresses instead of only one:
> testdata[1]
[1] "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg,Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem,Leipzig, Germany; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery,Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem & MolDiagnost, Leipzig, Germany; [Toenjes, Anke; Kern, Matthias; Blueher,Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept InternalMed, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] UnivLeipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany;[Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol &Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen;Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys & Biophys,Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst AnimSci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium PharmaceutAG, Martinsried, Germany"
> results[1]
  [1] "GERMANY"

How can I put the country names back into their original lines / order?
This is an example of the correct result I'd like to receive:

> results[1]
[1] "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY""GERMANY" "GERMANY"
How can I achieve this result?
I think counting the semicolons outside square brackets - i.e. theones before a "[" but behind a "]" would be helpful in this regard,but I'm not sure how to do that, unfortunately. These semicolonsdirectly follow the country names, like this, e.g.: "... Germany; [..."If I add "+ 1" to their number it results in the number of addressesfor each record / line.
Thank you very much in advance!

Faithfully yours,

Sabina Arndt


Am 26.05.2012 00:19, schrieb Sarah Goslee:
Part of your problem is that your regexes have spaces in them, so
that's what you're matching.

A small reproducible example would be more useful. I'm not feeling
inclined to wade through all your linked files on Friday evening, but
see if this helps:
testdata<- "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg,Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem,Leipzig, New Zealand; [Teupser, Daniel; Holdt, Lesca Miriam;Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem&Mol Diagnost, Leipzig, USA; [Toenjes, Anke; Kern, Matthias;Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, DeptInternal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter]Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig,Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm InstPharmacol& Toxicol, Leipzig, Germany; [Scheidt, Holger A.;Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst MedPhys& Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] HumboldtUniv, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin]Ingenium Pharmaceut AG, Martinsried, Germany"
results<- gsub("\\[.*?\\]", "", testdata)
results<- unlist(strsplit(results, ";"))
results<- sapply(results, function(x)sub("^.*, ([A-Za-z ]*)$","\\1", x))
names(results)<- NULL
results
[1] "New Zealand" "USA"         "Germany"     "Germany"     "Germany"
    "Germany"     "Germany"     "Germany"


Sarah
On Fri, May 25, 2012 at 4:31 PM, SabinaArndt<sabina.ar...@hotmail.de> wrote:
Hello r-help members,
the solutions which Sarah Goslee and arun sent to me in such aprompt andhelpful manner work well with the examples I cut from thedata.frame I'm
analyzing. Thank you very much for that!
I incorporated them into my R-script and discovered that it stilldoesn't
work properly, unfortunately. I have no idea why that's the case.
You see, I want to extract country names from the contents oftab-delimited
text files. This is an example of the data I'm using:
http://pastebin.com/mYZNDXg6
This is the script I'm using to import the data:
http://pastebin.com/Z10UUH3z (It requires the text files to be ina folder
which doesn't contain any other .txt files.)
This is the script I'm using to extract the country names:
http://pastebin.com/G37fuPba
This is the string that's in the relevant field of the firstrecord I'm
working on:
[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten;Schulz,Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany;[Teupser,Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, FacMed, InstLab Med Clin Chem& Mol Diagnost, Leipzig, Germany; [Toenjes,Anke; Kern,Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, FacMed, DeptInternal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter]Univ
Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany;
[Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm InstPharmacol& Toxicol,Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster,Daniel]
Univ Leipzig, Fac Med, Inst Med Phys&  Biophys, Leipzig, Germany;
[Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin,
Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried,Germany
This is the incorrect result my extraction script gives me for thefirst
record:
C1s[1]
  [1] "[ENGEL,  KATHRIN M. Y." "KRISTIN"                "TORSTEN"
[4] "GERMANY" "DANIEL" "LESCAMIRIAM"
  [7] "GERMANY"                "ANKE"                   "MATTHIAS"
[10] "MATTHIAS"               "GERMANY"                "KERSTIN"
[13] "GERMANY" "GERMANY" "[SCHEIDT,HOLGER
A."
[16] "JUERGEN"                "GERMANY"                "HUMBOLDT"
[19] "GERMANY"
For some reason the first and sixth pair of the eight squarebrackets are
not removed ... Do you understand why?
Instead I'd like to get this result, though:
C1s[1]
  [1] "GERMANY"        "GERMANY"        "GERMANY"
  [4] "GERMANY"        "GERMANY"        "GERMANY"
  [7] "HUMBOLDT"        "GERMANY"

What am I doing wrong? What are the errors in my R-script?
Would anybody be so kind as to take a look and help me out, please?
Thank you very much in advance!

Faithfully yours,

Sabina Arndt


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to remove square brackets, etc. from address strings?

Reply via email to