Mike wrote: > Hi, I am using Python to scrape web pages and I do not have problem > unless I run into a site that is utf-8. It seems & is changed to > & when the site is utf-8. > > [...]
> Any ideas? How about using the universal feedparser from feedparser.org to fetch and parse the RSS from Reuters? That's what I do and it works like a charm. #v+ >>> import feedparser >>> rss = feedparser.parse('http://today.reuters.com/rss/topNews') >>> for what in ('link', 'title', 'summary'): ... print rss.entries[0][what] ... print ... http://today.reuters.com/news/newsarticle.aspx?type=topNews&storyid=2005-10-05T193846Z_01_DIT561620_RTRUKOC_0_US-COURT-SUICIDE.xml Top court seems closely divided on suicide law During arguments, the justices sharply questioned both sides on whether then-Attorney General John Ashcroft had the power under federal law in 2001 to bar distribution of controlled drugs to assist suicides, regardless of state law. >>> #v- Cheers, -- Klaus Alexander Seistrup Magnetic Ink, Copenhagen, Denmark http://magnetic-ink.dk/ -- http://mail.python.org/mailman/listinfo/python-list