"?????? ???????????" <gdam...@gmail.com> wrote in message news:ciqh56-ses....@archaeopteryx.softver.org.mk... > So, I'm using lxml to screen scrap a site that uses the cyrillic > alphabet (windows-1251 encoding). The sites HTML doesn't have the <META > ..content-type.. charset=..> header, but does have a HTTP header that > specifies the charset... so they are standards compliant enough. > > Now when I run this code: > > from lxml import html > doc = html.parse('http://a1.com.mk/') > root = doc.getroot() > title = root.cssselect(('head title'))[0] > print title.text > > the title.text is ? unicode string, but it has been wrongly decoded as > latin1 -> unicode > > So.. is this a deficiency/bug in lxml or I'm doing something wrong. > Also, what are my other options here? > > > I'm running Python 2.6.1 and python-lxml 2.1.4 on Linux if matters. > > -- > ?????? ( http://softver.org.mk/damjan/ ) > > "Debugging is twice as hard as writing the code in the first place. > Therefore, if you write the code as cleverly as possible, you are, > by definition, not smart enough to debug it." - Brian W. Kernighan >
The way I do that is to open the file with codecs, encoding=cp1251, read it into variable and feed that to the parser. --Tim -- http://mail.python.org/mailman/listinfo/python-list