If we can assume that the abstract is always the 4th paragraph then we can try something like this:
library(XML) doc <- xmlTreeParse("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-", isURL = TRUE, useInternalNodes = TRUE, trim = TRUE) out <- cbind( Author = unlist(xpathApply(doc, "//author", xmlValue)), PMID = gsub(".*:", "", unlist(xpathApply(doc, "//guid", xmlValue))), Abstract = unlist(xpathApply(doc, "//description", function(x) { on.exit(free(doc2)) doc2 <- htmlTreeParse(xmlValue(x)[[1]], asText = TRUE, useInternalNodes = TRUE, trim = TRUE) xpathApply(doc2, "//p[4]", xmlValue) } ))) free(doc) substring(out, 1, 25) # display first 25 chars of each field The last line produces (it may look messed up in this email): > substring(out, 1, 25) # display it Author PMID Abstract [1,] " Goon P, Sonnex C, Jani P" "18046565" "Human papillomaviruses (H" [2,] " Rad MH, Alizadeh E, Ilkh" "17978930" "Recurrent laryngeal papil" [3,] " Lee LA, Cheng AJ, Fang T" "17975511" "OBJECTIVES:: Papillomas o" [4,] " Gerein V, Schmandt S, Ba" "17935912" "BACKGROUND: Human papillo" [5,] " Hopp R, Natarajan N, Lew" "17908862" "" [6,] " Preuss SF, Klussmann JP," "17851940" "CONCLUSIONS: The presente" [7,] " Mouadeb DA, Belafsky PC" "17765779" "OBJECTIVES: The 585nm pul" [8,] " Thompson L" "17702311" "" [9,] " Schaffer A, Brotherton J" "17688640" "" [10,] " Stephen JK, Vaught LE, C" "17638782" "OBJECTIVE: To investigate" [11,] " Shah KV, Westra WH" "17627059" "" [12,] " Koufman JA, Rees CJ, Fra" "17599582" "BACKGROUND: Unsedated off" [13,] " Akst LM, Broadhurst MS, " "17592395" "" [14,] " Pignatari SS, Liriano RY" "17589729" "Evidence of a relation be" On Dec 15, 2007 10:13 PM, David Winsemius <[EMAIL PROTECTED]> wrote: > David Winsemius <[EMAIL PROTECTED]> wrote in > news:[EMAIL PROTECTED]: > > > "Farrel Buchinsky" <[EMAIL PROTECTED]> wrote in > > news:[EMAIL PROTECTED]: > > > >> On Dec 13, 2007 11:35 PM, Robert Gentleman <[EMAIL PROTECTED]> > >> wrote: > >>> or just try looking in the annotate package from Bioconductor > >>> > >> > >> Yip. annotate seems to be the most streamlined way to do this. > >> 1) How does one turn the list that is created into a dataframe whose > >> column names are along the lines of date, title, journal, authors etc > > > > Gabor's example already did that task. > > > > Actually the object returned by Gabor's method was a list of lists. Here > is one way (probably very inefficient) of getting "doc" into a > data.frame: > > colvals <-sapply(c("//title", "//author", "//category"), xpathApply, > doc = doc, fun = xmlValue) > > titles=as.vector(unlist(colvals[1])[3:17]) > > # needed to drop extraneous titles for search name and an NCBI header > #>str(colvals) > #List of 3 > # $ //title :List of 17 > # ..$ : chr "PubMed: (\"Laryngeal Neoplasm..." > # ..$ : chr "NCBI PubMed" > > authors=colvals[[2]] > jrnls=colvals[[3]] > > # not sure why, but trying to do it in one step failed: > # cites<-data.frame(titles=as.vector(unlist(colvals[1])[3:17]), > # authors=colvals[[2]],jnrls=colvals[[3]]) > # Error in data.frame(titles = as.vector(unlist(colvals[1])[3:17]), > # authors = colvals[[2]], : > # arguments imply differing number of rows: 15, 1 > # but the following worked > > cites<-data.frame(titles=as.vector(titles)) > cites$author<-authors > cites$jrnls<-jrnls > cites > > I am still wondering how to extract material that does not have an XML > tag. Each item looks like: > > <item> > <title>Gastroesophageal reflux in patients with recurrent laryngeal > papillomatosis.</title> > <link>http://www.ncbi.nlm.nih.gov/entrez/query.fcgi? > tmpl=NoSidebarfile&db=PubMed&cmd=Retrieve&list_uids=17589729 > &dopt=Abstract</link> > <description> > <![CDATA[ > <table border="0" width="100%"><tr><td align="left"><a > href="http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0034- > 72992007000200011&lng=en&nrm=iso&tlng=en"><img > src="http://www.ncbi.nlm.nih.gov/entrez/query/egifs/http:--www.scielo.br- > img-scielo_en.gif" border="0"/></a> </td><td align="right"><a > href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi? > db=PubMed&cmd=Display&dopt=PubMed_PubMed&from_uid=17589729"> > Related Articles</a></td></tr></table> > <p><b>Gastroesophageal reflux in patients with recurrent > laryngeal papillomatosis.</b></p> > <p>Rev Bras Otorrinolaringol (Engl Ed). 2007 Mar-Apr;73(2):210-4 > </p> > <p>Authors: Pignatari SS, Liriano RY, Avelino MA, Testa JR, > Fujita R, De Marco EK</p> > <p>Evidence of a relation between gastroesophaeal reflux and > pediatric respiratory disorders increases every year. Many respiratory > symptoms and clinical conditions such as stridor, chronic cough, and > recurrent pneumonia and bronchitis appear to be related to > gastroesophageal reflux. Some studies have also suggested that > gastroesophageal reflux may be associated with recurrent laryngeal > papillomatosis, contributing to its recurrence and severity. AIM: the aim > of this study was to verify the frequency and intensity of > gastroesophageal reflux in children with recurrent laryngeal > papillomatosis. MATERIAL AND METHODS: ten children of both genders, aged > between 3 and 12 years, presenting laryngeal papillomatosis, were > included in this study. The children underwent 24-hour double-probe pH- > metry. RESULTS: fifty percent of the patients had evidence of > gastroesophageal reflux at the distal sphincter; 90% presented reflux at > the proximal sphincter. CONCLUSION: the frequency of proximal > gastroesophageal reflux is significantly increased in patients with > recurrent laryngeal papillomatosis.</p> > <p>PMID: 17589729 [PubMed - in process]</p> ]]> > </description> > <author>Pignatari SS, Liriano RY, Avelino MA, Testa JR, Fujita R, De > Marco EK</author> > <category>Rev Bras Otorrinolaringol (Engl Ed)</category> > <guid isPermaLink="false">PubMed:17589729</guid> > </item> > > I would like to access, for instance, the PMID or the abstract within the > <description> element, but I do not think that they have names in the the > same way that <author> or <category> have xml named nodes. I suspect that > getting the output in a different format, say as MEDLINE, might produce > output that was tagged more completely. > > > -- > David Winsemius > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.