Hello, thank you very much for your reply, Rui Barradas. OK, I did what you said:
> x <- readLines("sabina.txt") > s <- strsplit(x, ";[[:space:]]\\[") > r <- lapply(s, function(x) sapply(strsplit(x, "[[:blank:]]"), tail, 1)) > length(r) [1] 20 I don't know why your result here was 21 since the file consists of only 20 lines. > r[[21]] <- NULL > r[[20]] <- r[[20]][ -length(r[[20]]) ] > r1 <- lapply(r, function(x) x[nchar(x) > 0]) > country.list <- r1[ -which(sapply(r1, function(x) is.null(x))) ] > rm(s, r, r1) > country.list list() I also tried this: > r[[20]] <- NULL > r[[19]] <- r[[19]][ -length(r[[19]]) ] > r1 <- lapply(r, function(x) x[nchar(x) > 0]) > country.list <- r1[ -which(sapply(r1, function(x) is.null(x))) ] > rm(s, r, r1) > country.list list() But the result was the same. For some reason this seems to be empty. But if I try this before "country.list <- r1[ -which(sapply(r1, function(x) is.null(x))) ]": >r[[18]] [1] "England" "Scotland" "Germany" "Germany" "England" "WOS:000296579800006" This is almost correct. But the last country name is missing of this record and replaced with the value of the very last column / field of this record. Do you know how to correct this? In addition to that there are some additional adjustments I need to apply to the country names before output since there are many different versions of US addresses, e.g. (See 000296579800006.). I'm not sure I understand your function correctly, do you think the edits I mentioned could be fit in there as well? Thank you very much for bearing with me! I swear I ususally am not that dumb! Faithfully yours, Sabina Arndt > Date: Tue, 29 May 2012 13:00:36 +0100 > From: ruipbarra...@sapo.pt > To: sabina.ar...@hotmail.de > CC: r-help@r-project.org > Subject: Re: [R] Relist strings? [Was: How to remove square brackets, etc. > from address strings?] > > Hello, > > The error message means that 'x' is not a character vector. Can't you > try it only with the text in the link you've posted, > http://pastebin.com/mYZNDXg6 ? > > I'm asking this because I've just checked it and it doesn't give any eror. > > Em 29-05-2012 12:39, Sabina Arndt escreveu: > > Hello r-help members, > > > > thank you very much for your reply, Rui Barradas. > > > >> Your data file has more than one line. > > Yes, each line is a new record and I read several such data files into one > > data.frame. > > > This is problably why it gives you that error. Process just one file, > like I've said, then say something. > (Moreover, it makes sense to solve the problems with a smaller set then > move on to the larger one.) > > Rui Barradas > > > > > >> I've called it "sabrina.txt" and then processed with: > >> > >> x<- readLines("sabrina.txt") > >> > >> s<- strsplit(x, ";[[:space:]]\\[") > > Thank you; but this gives me an error message: > > > > Error in strsplit(x, ";[[:space:]]\\[") : non-character argument > > > > So I cannot check the rest of your suggestion, unfortunately. > > > >>> Do you happen to have any idea on how I could put the country names > >>> back into their original lines / order, though? > > ... > >> As far as I can tell they're in the original order. But what do you mean > >> by "back into their original lines"? > > Each line of my data.frame represents a record - except for the first one > > which is the header. Each record has different addresses in the field / > > column I'm analyzing. In fact, the records vary in the number of addresses > > they feature (The first has eight, the second only one, etc.). I don't want > > a simple list of all the country names but a new field in my data.frame > > which contains for each record the country name(s) extracted from the > > addresses of that very same record. > > I'd like to measure the number of elements after applying strsplit() to > > each string. I tried: > > > > ... > > results<- strsplit(results, ";") > > numbers<- sapply(results, length) > > results<- unlist(results) > > ... > > > > But this doesn't seem to work, because: > > > >> numbers > > [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 > > 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 > > 1 1 > > ... > > > > Does anybody know how I would achieve these results instead: > > > >> numbers[1] > > [1] 8 > >> numbers[2] > > [1] 1 > >> results[1] > > [1] "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" > > "GERMANY" > >> results[2] > > [1] "GERMANY" > > > > Thank you very much in advance! > > > > Faithfully yours, > > > > Sabina Arndt > > > > PS: I updated the subject of my message to reflect the progress I've made > > thanks to your replies. I hope this is appropriate and clearer this way. > > > > > >>> Am 27.05.2012 19:04, schrieb Rui Barradas: > >>>> Hello, > >>>> > >>>> Though I've not been following this thread, it seems like a regular > >>>> expressions problem. > >>>> In the code below, I've created a 'testdata' variable based on your > >>>> post. > >>>> > >>>> # create a vector with two elements. > >>>> x<- "[Engel, Kathrin M. Y.; Schroeck, ... etc ... > >>>> y<- gsub("Germany", "Portugal", x) > >>>> testdata<- c(x, y) > >>>> > >>>> # 's' is a list of character vectors, each element's final word is a > >>>> country > >>>> s<- strsplit(testdata, ";[[:space:]]+\\[") > >>>> lapply(s, function(x) sapply(strsplit(x, "[[:blank:]]"), tail, 1)) > >>>> > >>>> > >>>> If this isn't it, sorry for the intrusion. > >>>> > >>>> Rui Barradas > >>>> > >>>> Em 27-05-2012 17:29, Sabina Arndt escreveu: > >>>>> Hello r-help members, > >>>>> > >>>>> I'm very grateful for the reply which Sarah Goslee sent to me in > >>>>> such a prompt and helpful manner. > >>>>> It took me some time, but with a few amendments her suggestion now > >>>>> works not only for an example but for my entire data file as well: > >>>>> > >>>>>> results > >>>>> [1] "GERMANY" "GERMANY" "GERMANY" "GERMANY" > >>>>> [5] "GERMANY" "GERMANY" "GERMANY" "GERMANY" > >>>>> ... > >>>>> > >>>>> Thank you very much for that, dear Sarah! > >>>>> > >>>>> All these names actually belong to the very first record, though, > >>>>> which contains eight addresses instead of only one: > >>>>> > >>>>>> testdata[1] > >>>>> [1] "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, > >>>>> Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, > >>>>> Leipzig, Germany; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery, > >>>>> Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem& Mol > >>>>> Diagnost, Leipzig, Germany; [Toenjes, Anke; Kern, Matthias; Blueher, > >>>>> Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept Internal > >>>>> Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ > >>>>> Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany; > >>>>> [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol& > >>>>> Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; > >>>>> Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys& Biophys, > >>>>> Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim > >>>>> Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium Pharmaceut > >>>>> AG, Martinsried, Germany" > >>>>>> results[1] > >>>>> [1] "GERMANY" > >>>>> > >>>>> How can I put the country names back into their original lines / order? > >>>>> This is an example of the correct result I'd like to receive: > >>>>> > >>>>>> results[1] > >>>>> [1] "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" > >>>>> "GERMANY" "GERMANY" > >>>>> > >>>>> How can I achieve this result? > >>>>> > >>>>> I think counting the semicolons outside square brackets - i.e. the > >>>>> ones before a "[" but behind a "]" would be helpful in this regard, > >>>>> but I'm not sure how to do that, unfortunately. These semicolons > >>>>> directly follow the country names, like this, e.g.: "... Germany; [..." > >>>>> If I add "+ 1" to their number it results in the number of addresses > >>>>> for each record / line. > >>>>> > >>>>> Thank you very much in advance! > >>>>> > >>>>> Faithfully yours, > >>>>> > >>>>> Sabina Arndt > >>>>> > >>>>> > >>>>> Am 26.05.2012 00:19, schrieb Sarah Goslee: > >>>>>> Part of your problem is that your regexes have spaces in them, so > >>>>>> that's what you're matching. > >>>>>> > >>>>>> A small reproducible example would be more useful. I'm not feeling > >>>>>> inclined to wade through all your linked files on Friday evening, but > >>>>>> see if this helps: > >>>>>> > >>>>>>> testdata<- "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, > >>>>>>> Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, > >>>>>>> Leipzig, New Zealand; [Teupser, Daniel; Holdt, Lesca Miriam; > >>>>>>> Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem& > >>>>>>> Mol Diagnost, Leipzig, USA; [Toenjes, Anke; Kern, Matthias; > >>>>>>> Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept > >>>>>>> Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] > >>>>>>> Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, > >>>>>>> Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst > >>>>>>> Pharmacol& Toxicol, Leipzig, Germany; [Scheidt, Holger A.; > >>>>>>> Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med > >>>>>>> Phys& Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt > >>>>>>> Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin] > >>>>>>> Ingenium Pharmaceut AG, Martinsried, Germany" > >>>>>>> results<- gsub("\\[.*?\\]", "", testdata) > >>>>>>> results<- unlist(strsplit(results, ";")) > >>>>>>> results<- sapply(results, function(x)sub("^.*, ([A-Za-z ]*)$", > >>>>>>> "\\1", x)) > >>>>>>> names(results)<- NULL > >>>>>>> results > >>>>>> [1] "New Zealand" "USA" "Germany" "Germany" "Germany" > >>>>>> "Germany" "Germany" "Germany" > >>>>>> > >>>>>> > >>>>>> Sarah > >>>>>> > >>>>>> On Fri, May 25, 2012 at 4:31 PM, Sabina > >>>>>> Arndt<sabina.ar...@hotmail.de> wrote: > >>>>>>> Hello r-help members, > >>>>>>> > >>>>>>> the solutions which Sarah Goslee and arun sent to me in such a > >>>>>>> prompt and > >>>>>>> helpful manner work well with the examples I cut from the > >>>>>>> data.frame I'm > >>>>>>> analyzing. Thank you very much for that! > >>>>>>> I incorporated them into my R-script and discovered that it still > >>>>>>> doesn't > >>>>>>> work properly, unfortunately. I have no idea why that's the case. > >>>>>>> You see, I want to extract country names from the contents of > >>>>>>> tab-delimited > >>>>>>> text files. This is an example of the data I'm using: > >>>>>>> http://pastebin.com/mYZNDXg6 > >>>>>>> This is the script I'm using to import the data: > >>>>>>> http://pastebin.com/Z10UUH3z (It requires the text files to be in > >>>>>>> a folder > >>>>>>> which doesn't contain any other .txt files.) > >>>>>>> This is the script I'm using to extract the country names: > >>>>>>> http://pastebin.com/G37fuPba > >>>>>>> This is the string that's in the relevant field of the first > >>>>>>> record I'm > >>>>>>> working on: > >>>>>>> > >>>>>>> [Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; > >>>>>>> Schulz, > >>>>>>> Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany; > >>>>>>> [Teupser, > >>>>>>> Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac > >>>>>>> Med, Inst > >>>>>>> Lab Med Clin Chem& Mol Diagnost, Leipzig, Germany; [Toenjes, > >>>>>>> Anke; Kern, > >>>>>>> Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac > >>>>>>> Med, Dept > >>>>>>> Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] > >>>>>>> Univ > >>>>>>> Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany; > >>>>>>> [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst > >>>>>>> Pharmacol& Toxicol, > >>>>>>> Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster, > >>>>>>> Daniel] > >>>>>>> Univ Leipzig, Fac Med, Inst Med Phys& Biophys, Leipzig, Germany; > >>>>>>> [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin, > >>>>>>> Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried, > >>>>>>> Germany > >>>>>>> > >>>>>>> This is the incorrect result my extraction script gives me for the > >>>>>>> first > >>>>>>> record: > >>>>>>> > >>>>>>>> C1s[1] > >>>>>>> [1] "[ENGEL, KATHRIN M. Y." "KRISTIN" "TORSTEN" > >>>>>>> [4] "GERMANY" "DANIEL" "LESCA > >>>>>>> MIRIAM" > >>>>>>> [7] "GERMANY" "ANKE" "MATTHIAS" > >>>>>>> [10] "MATTHIAS" "GERMANY" "KERSTIN" > >>>>>>> [13] "GERMANY" "GERMANY" "[SCHEIDT, > >>>>>>> HOLGER > >>>>>>> A." > >>>>>>> [16] "JUERGEN" "GERMANY" "HUMBOLDT" > >>>>>>> [19] "GERMANY" > >>>>>>> > >>>>>>> For some reason the first and sixth pair of the eight square > >>>>>>> brackets are > >>>>>>> not removed ... Do you understand why? > >>>>>>> Instead I'd like to get this result, though: > >>>>>>> > >>>>>>>> C1s[1] > >>>>>>> [1] "GERMANY" "GERMANY" "GERMANY" > >>>>>>> [4] "GERMANY" "GERMANY" "GERMANY" > >>>>>>> [7] "HUMBOLDT" "GERMANY" > >>>>>>> > >>>>>>> What am I doing wrong? What are the errors in my R-script? > >>>>>>> Would anybody be so kind as to take a look and help me out, please? > >>>>>>> Thank you very much in advance! > >>>>>>> > >>>>>>> Faithfully yours, > >>>>>>> > >>>>>>> Sabina Arndt [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.