Re: Using lxml to screen scrap a site, problem with charset

Tim Arnold Mon, 02 Feb 2009 21:10:50 -0800

"?????? ???????????" <[email protected]> wrote in message 
news:[email protected]...
> So, I'm using lxml to screen scrap a site that uses the cyrillic
> alphabet (windows-1251 encoding). The sites HTML doesn't have the <META
> ..content-type.. charset=..> header, but does have a HTTP header that
> specifies the charset... so they are standards compliant enough.
>
> Now when I run this code:
>
> from lxml import html
> doc = html.parse('http://a1.com.mk/')
> root = doc.getroot()
> title = root.cssselect(('head title'))[0]
> print title.text
>
> the title.text is ? unicode string, but it has been wrongly decoded as
> latin1 -> unicode
>
> So.. is this a deficiency/bug in lxml or I'm doing something wrong.
> Also, what are my other options here?
>
>
> I'm running Python 2.6.1 and python-lxml 2.1.4 on Linux if matters.
>
> -- 
> ?????? ( http://softver.org.mk/damjan/ )
>
> "Debugging is twice as hard as writing the code in the first place.
> Therefore, if you write the code as cleverly as possible, you are,
> by definition, not smart enough to debug it." - Brian W. Kernighan
>


The way I do that is to open the file with codecs, encoding=cp1251, read it 
into variable and feed that to the parser.

--Tim


--
http://mail.python.org/mailman/listinfo/python-list

Re: Using lxml to screen scrap a site, problem with charset

Reply via email to