Re: [R] Relist strings? [Was: How to remove square brackets, etc. from address strings?]

Sabina Arndt Tue, 29 May 2012 08:36:05 -0700

Hello,

thank you very much for your reply, Rui Barradas.
OK, I did what you said:


> x <- readLines("sabina.txt")
> s <- strsplit(x, ";[[:space:]]\\[")
> r <- lapply(s, function(x) sapply(strsplit(x, "[[:blank:]]"), tail, 1))
> length(r)
[1] 20

I don't know why your result here was 21 since the file consists of only 20 
lines.

> r[[21]] <- NULL 
> r[[20]] <- r[[20]][ -length(r[[20]]) ] 
> r1 <- lapply(r, function(x) x[nchar(x) > 0]) 
> country.list <- r1[ -which(sapply(r1, function(x) is.null(x))) ] 
> rm(s, r, r1) 
> country.list 
list()

I also tried this:

> r[[20]] <- NULL
> r[[19]] <- r[[19]][ -length(r[[19]]) ] 
> r1 <- lapply(r, function(x) x[nchar(x) > 0]) 
> country.list <- r1[ -which(sapply(r1, function(x) is.null(x))) ] 
> rm(s, r, r1) 
> country.list 
list()
But the result was the same. For some reason this seems to be empty.
But if I try this before "country.list <- r1[ -which(sapply(r1, function(x) 
is.null(x))) ]":

>r[[18]]
[1] "England"             "Scotland"            "Germany"             "Germany" 
            "England"             "WOS:000296579800006"

This is almost correct. But the last country name is missing of this record and 
replaced with the value of the very last column / field of this record. Do you 
know how to correct this?
In addition to that there are some additional adjustments I need to apply to 
the country names before output since there are many different versions of US 
addresses, e.g. (See 000296579800006.). I'm not sure I understand your function 
correctly, do you think the edits I mentioned could be fit in there as well?

Thank you very much for bearing with me! I swear I ususally am not that dumb!

Faithfully yours,

Sabina Arndt


> Date: Tue, 29 May 2012 13:00:36 +0100
> From: ruipbarra...@sapo.pt
> To: sabina.ar...@hotmail.de
> CC: r-help@r-project.org
> Subject: Re: [R] Relist strings? [Was: How to remove square brackets, etc. 
> from address strings?]
> 
> Hello,
> 
> The error message means that 'x' is not a character vector. Can't you 
> try it only with the text in the link you've posted, 
> http://pastebin.com/mYZNDXg6 ?
> 
> I'm asking this because I've just checked it and it doesn't give any eror.
> 
> Em 29-05-2012 12:39, Sabina Arndt escreveu:
> > Hello r-help members,
> >
> > thank you very much for your reply, Rui Barradas.
> >
> >> Your data file has more than one line.
> > Yes, each line is a new record and I read several such data files into one 
> > data.frame.
> 
> 
> This is problably why it gives you that error. Process just one file, 
> like I've said, then say something.
> (Moreover, it makes sense to solve the problems with a smaller set then 
> move on to the larger one.)
> 
> Rui Barradas
> 
> 
> >
> >> I've called it "sabrina.txt" and then processed with:
> >>
> >> x<- readLines("sabrina.txt")
> >>
> >> s<- strsplit(x, ";[[:space:]]\\[")
> > Thank you; but this gives me an error message:
> >
> > Error in strsplit(x, ";[[:space:]]\\[") : non-character argument
> >
> > So I cannot check the rest of your suggestion, unfortunately.
> >
> >>> Do you happen to have any idea on how I could put the country names
> >>> back into their original lines / order, though?
> > ...
> >> As far as I can tell they're in the original order. But what do you mean
> >> by "back into their original lines"?
> > Each line of my data.frame represents a record - except for the first one 
> > which is the header. Each record has different addresses in the field / 
> > column I'm analyzing. In fact, the records vary in the number of addresses 
> > they feature (The first has eight, the second only one, etc.). I don't want 
> > a simple list of all the country names but a new field in my data.frame 
> > which contains for each record the country name(s) extracted from the 
> > addresses of that very same record.
> > I'd like to measure the number of elements after applying strsplit() to 
> > each string. I tried:
> >
> > ...
> > results<- strsplit(results, ";")
> > numbers<- sapply(results, length)
> > results<- unlist(results)
> > ...
> >
> > But this doesn't seem to work, because:
> >
> >> numbers
> >    [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
> > 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
> > 1 1
> > ...
> >
> > Does anybody know how I would achieve these results instead:
> >
> >> numbers[1]
> >   [1] 8
> >> numbers[2]
> >   [1] 1
> >> results[1]
> >   [1] "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" 
> > "GERMANY"
> >> results[2]
> >   [1] "GERMANY"
> >
> > Thank you very much in advance!
> >
> > Faithfully yours,
> >
> > Sabina Arndt
> >
> > PS: I updated the subject of my message to reflect the progress I've made 
> > thanks to your replies. I hope this is appropriate and clearer this way.
> >
> >
> >>> Am 27.05.2012 19:04, schrieb Rui Barradas:
> >>>> Hello,
> >>>>
> >>>> Though I've not been following this thread, it seems like a regular
> >>>> expressions problem.
> >>>> In the code below, I've created a 'testdata' variable based on your
> >>>> post.
> >>>>
> >>>> # create a vector with two elements.
> >>>> x<- "[Engel, Kathrin M. Y.; Schroeck, ... etc ...
> >>>> y<- gsub("Germany", "Portugal", x)
> >>>> testdata<- c(x, y)
> >>>>
> >>>> # 's' is a list of character vectors, each element's final word is a
> >>>> country
> >>>> s<- strsplit(testdata, ";[[:space:]]+\\[")
> >>>> lapply(s, function(x) sapply(strsplit(x, "[[:blank:]]"), tail, 1))
> >>>>
> >>>>
> >>>> If this isn't it, sorry for the intrusion.
> >>>>
> >>>> Rui Barradas
> >>>>
> >>>> Em 27-05-2012 17:29, Sabina Arndt escreveu:
> >>>>> Hello r-help members,
> >>>>>
> >>>>> I'm very grateful for the reply which Sarah Goslee sent to me in
> >>>>> such a prompt and helpful manner.
> >>>>> It took me some time, but with a few amendments her suggestion now
> >>>>> works not only for an example but for my entire data file as well:
> >>>>>
> >>>>>> results
> >>>>>    [1] "GERMANY"         "GERMANY"         "GERMANY"        "GERMANY"
> >>>>>    [5] "GERMANY"         "GERMANY"         "GERMANY"        "GERMANY"
> >>>>> ...
> >>>>>
> >>>>> Thank you very much for that, dear Sarah!
> >>>>>
> >>>>> All these names actually belong to the very first record, though,
> >>>>> which contains eight addresses instead of only one:
> >>>>>
> >>>>>> testdata[1]
> >>>>>    [1] "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg,
> >>>>> Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem,
> >>>>> Leipzig, Germany; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery,
> >>>>> Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem&  Mol
> >>>>> Diagnost, Leipzig, Germany; [Toenjes, Anke; Kern, Matthias; Blueher,
> >>>>> Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept Internal
> >>>>> Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ
> >>>>> Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany;
> >>>>> [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol&
> >>>>> Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen;
> >>>>> Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys&  Biophys,
> >>>>> Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim
> >>>>> Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium Pharmaceut
> >>>>> AG, Martinsried, Germany"
> >>>>>> results[1]
> >>>>>    [1] "GERMANY"
> >>>>>
> >>>>> How can I put the country names back into their original lines / order?
> >>>>> This is an example of the correct result I'd like to receive:
> >>>>>
> >>>>>> results[1]
> >>>>>    [1] "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY"
> >>>>> "GERMANY" "GERMANY"
> >>>>>
> >>>>> How can I achieve this result?
> >>>>>
> >>>>> I think counting the semicolons outside square brackets - i.e. the
> >>>>> ones before a "[" but behind a "]" would be helpful in this regard,
> >>>>> but I'm not sure how to do that, unfortunately. These semicolons
> >>>>> directly follow the country names, like this, e.g.: "... Germany; [..."
> >>>>> If I add "+ 1" to their number it results in the number of addresses
> >>>>> for each record / line.
> >>>>>
> >>>>> Thank you very much in advance!
> >>>>>
> >>>>> Faithfully yours,
> >>>>>
> >>>>> Sabina Arndt
> >>>>>
> >>>>>
> >>>>> Am 26.05.2012 00:19, schrieb Sarah Goslee:
> >>>>>> Part of your problem is that your regexes have spaces in them, so
> >>>>>> that's what you're matching.
> >>>>>>
> >>>>>> A small reproducible example would be more useful. I'm not feeling
> >>>>>> inclined to wade through all your linked files on Friday evening, but
> >>>>>> see if this helps:
> >>>>>>
> >>>>>>> testdata<- "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg,
> >>>>>>> Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem,
> >>>>>>> Leipzig, New Zealand; [Teupser, Daniel; Holdt, Lesca Miriam;
> >>>>>>> Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem&
> >>>>>>> Mol Diagnost, Leipzig, USA; [Toenjes, Anke; Kern, Matthias;
> >>>>>>> Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept
> >>>>>>> Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter]
> >>>>>>> Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig,
> >>>>>>> Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst
> >>>>>>> Pharmacol&   Toxicol, Leipzig, Germany; [Scheidt, Holger A.;
> >>>>>>> Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med
> >>>>>>> Phys&   Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt
> >>>>>>> Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin]
> >>>>>>> Ingenium Pharmaceut AG, Martinsried, Germany"
> >>>>>>> results<- gsub("\\[.*?\\]", "", testdata)
> >>>>>>> results<- unlist(strsplit(results, ";"))
> >>>>>>> results<- sapply(results, function(x)sub("^.*, ([A-Za-z ]*)$",
> >>>>>>> "\\1", x))
> >>>>>>> names(results)<- NULL
> >>>>>>> results
> >>>>>> [1] "New Zealand" "USA"         "Germany"     "Germany"     "Germany"
> >>>>>>      "Germany"     "Germany"     "Germany"
> >>>>>>
> >>>>>>
> >>>>>> Sarah
> >>>>>>
> >>>>>> On Fri, May 25, 2012 at 4:31 PM, Sabina
> >>>>>> Arndt<sabina.ar...@hotmail.de>   wrote:
> >>>>>>> Hello r-help members,
> >>>>>>>
> >>>>>>> the solutions which Sarah Goslee and arun sent to me in such a
> >>>>>>> prompt and
> >>>>>>> helpful manner work well with the examples I cut from the
> >>>>>>> data.frame I'm
> >>>>>>> analyzing. Thank you very much for that!
> >>>>>>> I incorporated them into my R-script and discovered that it still
> >>>>>>> doesn't
> >>>>>>> work properly, unfortunately. I have no idea why that's the case.
> >>>>>>> You see, I want to extract country names from the contents of
> >>>>>>> tab-delimited
> >>>>>>> text files. This is an example of the data I'm using:
> >>>>>>> http://pastebin.com/mYZNDXg6
> >>>>>>> This is the script I'm using to import the data:
> >>>>>>> http://pastebin.com/Z10UUH3z (It requires the text files to be in
> >>>>>>> a folder
> >>>>>>> which doesn't contain any other .txt files.)
> >>>>>>> This is the script I'm using to extract the country names:
> >>>>>>> http://pastebin.com/G37fuPba
> >>>>>>> This is the string that's in the relevant field of the first
> >>>>>>> record I'm
> >>>>>>> working on:
> >>>>>>>
> >>>>>>> [Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten;
> >>>>>>> Schulz,
> >>>>>>> Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany;
> >>>>>>> [Teupser,
> >>>>>>> Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac
> >>>>>>> Med, Inst
> >>>>>>> Lab Med Clin Chem&   Mol Diagnost, Leipzig, Germany; [Toenjes,
> >>>>>>> Anke; Kern,
> >>>>>>> Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac
> >>>>>>> Med, Dept
> >>>>>>> Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter]
> >>>>>>> Univ
> >>>>>>> Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany;
> >>>>>>> [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst
> >>>>>>> Pharmacol&   Toxicol,
> >>>>>>> Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster,
> >>>>>>> Daniel]
> >>>>>>> Univ Leipzig, Fac Med, Inst Med Phys&   Biophys, Leipzig, Germany;
> >>>>>>> [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin,
> >>>>>>> Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried,
> >>>>>>> Germany
> >>>>>>>
> >>>>>>> This is the incorrect result my extraction script gives me for the
> >>>>>>> first
> >>>>>>> record:
> >>>>>>>
> >>>>>>>> C1s[1]
> >>>>>>>    [1] "[ENGEL,  KATHRIN M. Y." "KRISTIN"                "TORSTEN"
> >>>>>>>    [4] "GERMANY"                "DANIEL"                 "LESCA
> >>>>>>> MIRIAM"
> >>>>>>>    [7] "GERMANY"                "ANKE"                   "MATTHIAS"
> >>>>>>> [10] "MATTHIAS"               "GERMANY"                "KERSTIN"
> >>>>>>> [13] "GERMANY"                "GERMANY"                "[SCHEIDT,
> >>>>>>> HOLGER
> >>>>>>> A."
> >>>>>>> [16] "JUERGEN"                "GERMANY"                "HUMBOLDT"
> >>>>>>> [19] "GERMANY"
> >>>>>>>
> >>>>>>> For some reason the first and sixth pair of the eight square
> >>>>>>> brackets are
> >>>>>>> not removed ... Do you understand why?
> >>>>>>> Instead I'd like to get this result, though:
> >>>>>>>
> >>>>>>>> C1s[1]
> >>>>>>>    [1] "GERMANY"        "GERMANY"        "GERMANY"
> >>>>>>>    [4] "GERMANY"        "GERMANY"        "GERMANY"
> >>>>>>>    [7] "HUMBOLDT"        "GERMANY"
> >>>>>>>
> >>>>>>> What am I doing wrong? What are the errors in my R-script?
> >>>>>>> Would anybody be so kind as to take a look and help me out, please?
> >>>>>>> Thank you very much in advance!
> >>>>>>>
> >>>>>>> Faithfully yours,
> >>>>>>>
> >>>>>>> Sabina Arndt
                                          
        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Relist strings? [Was: How to remove square brackets, etc. from address strings?]

Reply via email to