Pak (or Andrei, whichever is your first name), My proposal below:
----- Original Message ----- From: <[EMAIL PROTECTED]> Newsgroups: comp.lang.python To: <python-list@python.org> Sent: Sunday, July 30, 2006 8:52 PM Subject: Re: Html character entity conversion > danielx wrote: > > [EMAIL PROTECTED] wrote: > > > Here is my script: > > > > > > from mechanize import * > > > from BeautifulSoup import * > > > import StringIO > > > b = Browser() > > > f = b.open("http://www.translate.ru/text.asp?lang=ru";) > > > b.select_form(nr=0) > > > b["source"] = "hello python" > > > html = b.submit().get_data() > > > soup = BeautifulSoup(html) > > > print soup.find("span", id = "r_text").string > > > > > > OUTPUT: > > > привет > > > питон > > > ---------- > > > In russian it looks like: > > > "привет питон" > > > > > > How can I translate this using standard Python libraries?? > > > > > > -- > > > Pak Andrei, http://paxoblog.blogspot.com, icq://97449800 > > I've been proposing solutions of late using a stream editor I recently wrote, realizing each time how well it works in a vareity of different situations. I can only hope I am not beginning to get on people's nerves (Here he comes again with his damn thing!). I base the following on proposals others have made so far, because I haven't used unicodes and know little about them. If nothing else, I do think this is a rather elegant way to translate the ampersands to the unicode stirngs. Having to read them through an 'eval', though, doesn't seem to be the ultimate solution. I couldn't assign a unicode string to a variable so that it would print text as Claudio proposed. Here is my htm example: >>> htm = StringIO.StringIO (''' <htm> <!-- Examen --> <head><title>Deuxième question</title></head> <body bgcolor="#beb4a0" text="#000082" etc. > <b>L´élève doit lire et traduire:</b> привет питон<br> </body> </htm> ''') And here is my SE hack: >>> import SE # Available at the Cheese Shop >>> Ampersand_Filter = SE.SE (' <EAT> "~&#[0-9]+;~==(10)" ') >>> for line in htm: line = line [:-1] ampersand_codes = Ampersand_Filter (line [:-1]) # A list of the ampersand codes found in the current line if ampersand_codes: # From it we edit the substitution defintiions for the current line substitutions = '' for code in ampersand_codes.split ('\n')[:-1]: substitutions = '%s%s=\\u%04x\n' % (substitutions, code, int (code [2:-1])) # And make a custom Editor just for the current line Line_Unicoder = SE.SE (substitutions) unicode_line = Line_Unicoder (line) print eval ('u"%s"' % unicode_line) else: print line <htm> <!-- Examen --> <head><title>Deuxième question</title></head> <body bgcolor="#beb4a0" text="#000082" etc. > <b>L´élève doit lire et traduire:</b> привет питон<br> </body> </htm> This is a text book example of dynamic substitutions. Typically SE compiles static substituions lists. But with 2**16 (?) unicodes, building a static list would be absurd if at all possible. So we dynamically make custom substitutions for each line after extracting the ampersand escapes that may be there. Next we would like to fix the regular ascii ampersand escapes and also strip the tags. That is a simple question of preprocessing the file. >>> Legibilizer = SE.SE ('htm2iso.se "~<(.|\n)*?>~=" "~<!--(.|\n)*?-->~=" ') 'htm2iso.se' is a substitutions definition file that defines the standard ascii ampersands to characters. It is included in the SE package. You can name as many definition files as you want. In a definition string the name of a file is equivalent to its contents. >>> htm.seek (0) >>> htm_no_tags = Legibilizer (htm.read ()) >>> for line in htm_no_tags.split ('\n'): if line.strip () == '': continue ampersand_codes = Ampersand_Filter (line) ... (same as above) Deuxième question L'élève doit lire et traduire: привет питон Whether this serves your purpose I don't really know. How you can use it other than read it in the IDLE window, I don't know either.I tried to copy it out, but it doesn't survive the operation and the paste has question marks or squares in the place of the Russian letters. Regards Frederic -- http://mail.python.org/mailman/listinfo/python-list