Hello!

I am not experienced enough to know whether I have found a bug or
whether I am just ignorant.

I have been trying to use the tm package to read in material from RSS
2.0 feeds, which has required grappling with writing a reader for that
flavour of XML. I get an error - "Error : 1: EntityRef: expecting ';' -
which I think I've tracked down.

The feed being processed is from Wordpress:
http://scottbw.wordpress.com/feed/

Note that it contains a number of entity references in various places.
The trouble-makers seem to be & references that are the "&" in a URL
query string.
<media:content
url="http://0.gravatar.com/avatar/a1033a3e5956f5db65e0cc20f5ea167f?s=96&#38;d=identicon&#38;r=G";
 medium="image">

AFAIK, this is a correct encoding,

Parsing this with the following two lines followed by inspecting "t"
shows that the &#38; references have been translated to "&" while other
entity refs have not.

a<-readLines(url(as.character(feeds[2,2])))
t<-XML::xmlTreeParse(a, replaceEntities=FALSE, asText=TRUE)


I'm guessing this is what breaks things when I try to do things with tm:
rss2Reader <- readXML(
        spec = list(
                Author = list("node", "/item/creator"), 
                Content = list("node", "/item/description"),
                DateTimeStamp = list("function",function(x)   
as.POSIXlt(Sys.time(),
tz = "GMT")),
                Heading = list("node", "/item/title"),
                ID = list("function", function(x) tempfile()),
                Origin = list("node", "/item/link")),
        doc = PlainTextDocument())

rss2Source <- function(x, encoding = "UTF-8")
  XMLSource(x, function(tree)
XML::getNodeSet(XML::xmlRoot(tree),"/rss/channel/item"), rss2Reader,
encoding)

feed.rss2 <- rss2Source(url("http://scottbw.wordpress.com/feed/";))
corp1<-Corpus(feed.rss2, readerControl=list(language="en"))


I've googled around for this problem but got nowhere. Have I missed
something?

Any help will be received gratefully; this was supposed to be the easy
part!

Cheers, Adam

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to