Hello! I am not experienced enough to know whether I have found a bug or whether I am just ignorant.
I have been trying to use the tm package to read in material from RSS 2.0 feeds, which has required grappling with writing a reader for that flavour of XML. I get an error - "Error : 1: EntityRef: expecting ';' - which I think I've tracked down. The feed being processed is from Wordpress: http://scottbw.wordpress.com/feed/ Note that it contains a number of entity references in various places. The trouble-makers seem to be & references that are the "&" in a URL query string. <media:content url="http://0.gravatar.com/avatar/a1033a3e5956f5db65e0cc20f5ea167f?s=96&d=identicon&r=G" medium="image"> AFAIK, this is a correct encoding, Parsing this with the following two lines followed by inspecting "t" shows that the & references have been translated to "&" while other entity refs have not. a<-readLines(url(as.character(feeds[2,2]))) t<-XML::xmlTreeParse(a, replaceEntities=FALSE, asText=TRUE) I'm guessing this is what breaks things when I try to do things with tm: rss2Reader <- readXML( spec = list( Author = list("node", "/item/creator"), Content = list("node", "/item/description"), DateTimeStamp = list("function",function(x) as.POSIXlt(Sys.time(), tz = "GMT")), Heading = list("node", "/item/title"), ID = list("function", function(x) tempfile()), Origin = list("node", "/item/link")), doc = PlainTextDocument()) rss2Source <- function(x, encoding = "UTF-8") XMLSource(x, function(tree) XML::getNodeSet(XML::xmlRoot(tree),"/rss/channel/item"), rss2Reader, encoding) feed.rss2 <- rss2Source(url("http://scottbw.wordpress.com/feed/")) corp1<-Corpus(feed.rss2, readerControl=list(language="en")) I've googled around for this problem but got nowhere. Have I missed something? Any help will be received gratefully; this was supposed to be the easy part! Cheers, Adam ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.