Hello I'm using debian linux, Python 2.4.4, and utidylib (http:// utidylib.berlios.de/). I wrote simple functions to get a web page, convert it from windows-1251 to utf8 and then I'd like to clean html with it.
Here is two pages I use to check my program: http://www.ya.ru/ (in this case everything works ok) http://www.yellow-pages.ru/rus/nd2/qu5/ru15632 (in this case tidy did not return me anything just empty string) code: -------------- # coding: utf-8 import urllib, urllib2, tidy def get_page(url): user_agent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727)' headers = { 'User-Agent' : user_agent } data= {} req = urllib2.Request(url, data, headers) responce = urllib2.urlopen(req) page = responce.read() return page def convert_1251(page): p = page.decode('windows-1251') u = p.encode('utf-8') return u def clean_html(page): tidy_options = { 'output_xhtml' : 1, 'add_xml_decl' : 1, 'indent' : 1, 'input-encoding' : 'utf8', 'output-encoding' : 'utf8', 'tidy_mark' : 1, } cleaned_page = tidy.parseString(page, **tidy_options) return cleaned_page test_url = 'http://www.yellow-pages.ru/rus/nd2/qu5/ru15632' #test_url = 'http://www.ya.ru/' #f = open('yp.html', 'r') #p = f.read() print clean_html(convert_1251(get_page(test_url))) -------------- What am I doing wrong? Can anyone help, please? -- http://mail.python.org/mailman/listinfo/python-list