Re: Replacing utf-8 characters

Klaus Alexander Seistrup Wed, 05 Oct 2005 13:15:46 -0700

Mike wrote:

> Hi, I am using Python to scrape web pages and I do not have problem 
> unless I run into a site that is utf-8.  It seems & is changed to 
> &amp; when the site is utf-8.
>
>       [...]


> Any ideas?

How about using the universal feedparser from feedparser.org to fetch 
and parse the RSS from Reuters?  That's what I do and it works like a 
charm.

#v+

>>> import feedparser
>>> rss = feedparser.parse('http://today.reuters.com/rss/topNews')
>>> for what in ('link', 'title', 'summary'):
...     print rss.entries[0][what]
...     print
...
http://today.reuters.com/news/newsarticle.aspx?type=topNews&storyid=2005-10-05T193846Z_01_DIT561620_RTRUKOC_0_US-COURT-SUICIDE.xml

Top court seems closely divided on suicide law

During arguments, the justices sharply questioned both sides on whether 
then-Attorney General John Ashcroft had the power under federal law in 2001 to 
bar distribution of controlled drugs to assist suicides, regardless of state 
law.
>>> 

#v-

Cheers,

-- 
Klaus Alexander Seistrup
Magnetic Ink, Copenhagen, Denmark
http://magnetic-ink.dk/
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Replacing utf-8 characters

Reply via email to