Rares Vernica wrote: > Hi, > > How can I unescape HTML entities like " "? > > I know about xml.sax.saxutils.unescape() but it only deals with "&", > "<", and ">". > > Also, I know about htmlentitydefs.entitydefs, but not only this > dictionary is the opposite of what I need, it does not have " ". > > It has to be in python 2.4. > > Thanks a lot, > Ray > One way is this:
>>> import SE # Download from http://cheeseshop.python.org/pypi/SE/2.2%20beta >>> SE.SE ('HTM2ISO.se')('input_file_name', 'output_file_name') # HTM2ISO.se is included 'output_file_name' For repeated translations the SE object would be assigned to a variable: >>> HTM_Decoder = SE.SE ('HTM2ISO.se') SE objects take and return strings as well as file names which is useful for translating string variables, doing line-by-line translations and for interactive development or verification. A simple way to check a substitution set is to use its definitions as test data. The following is a section of the definition file HTM2ISO.se: test_string = ''' ø=(xf8) # 248 f8 ù=(xf9) # 249 f9 ú=(xfa) # 250 fa û=(xfb) # 251 fb ü=(xfc) # 252 fc ý=(xfd) # 253 fd þ=(xfe) # 254 fe é=(xe9) ê=(xea) ë=(xeb) ì=(xec) í=(xed) î=(xee) ï=(xef) ''' >>> print HTM_Decoder (test_string) ø=(xf8) # 248 f8 ù=(xf9) # 249 f9 ú=(xfa) # 250 fa û=(xfb) # 251 fb ü=(xfc) # 252 fc ý=(xfd) # 253 fd þ=(xfe) # 254 fe é=(xe9) ê=(xea) ë=(xeb) ì=(xec) í=(xed) î=(xee) ï=(xef) Another feature of SE is modularity. >>> strip_tags = ''' ~<(.|\x0a)*?>~=(9) # one tag to one tab ~<!--(.|\x0a)*?-->~=(9) # one comment to one tab | # run "~\x0a[ \x09\x0d\x0a]*~=(x0a)" # delete empty lines ~\t+~=(32) # one or more tabs to one space ~\x20\t+~=(32) # one space and one or more tabs to one space ~\t+\x20~=(32) # one or more tab and one space to one space ''' >>> HTM_Stripper_Decoder = SE.SE (strip_tags + ' HTM2ISO.se ') # Order doesn't matter If you write 'strip_tags' to a file, say 'STRIP_TAGS.se' you'd name it together with HTM2ISO.se: >>> HTM_Stripper_Decoder = SE.SE ('STRIP_TAGS.se HTM2ISO.se') # Order doesn't matter Or, if you have two SE objects, one for stripping tags and one for decoding the ampersands, you can nest them like this: >>> test_string = "<p class=MsoNormal style='line-height:110%'><i>René</i> est un garçon qui paraît plus âgé. </p>" >>> print Tag_Stripper (HTM_Decoder (test_string)) René est un garçon qui paraît plus âgé. Nesting works with file names too, because file names are returned: >>> Tag_Stripper (HTM_Decoder ('input_file_name'), 'output_file_name') 'output_file_name' Frederic -- http://mail.python.org/mailman/listinfo/python-list