Re: unescape HTML entities

Frederic Rentsch Sun, 29 Oct 2006 04:34:23 -0800

Rares Vernica wrote:
> Hi,
>
> How can I unescape HTML entities like "&nbsp;"?
>
> I know about xml.sax.saxutils.unescape() but it only deals with "&amp;", 
> "&lt;", and "&gt;".
>
> Also, I know about htmlentitydefs.entitydefs, but not only this 
> dictionary is the opposite of what I need, it does not have "&nbsp;".
>
> It has to be in python 2.4.
>
> Thanks a lot,
> Ray
>
One way is this:


 >>> import SE                                                      # 
Download from http://cheeseshop.python.org/pypi/SE/2.2%20beta
 >>> SE.SE ('HTM2ISO.se')('input_file_name', 'output_file_name')    # 
HTM2ISO.se is included
'output_file_name'

For repeated translations the SE object would be assigned to a variable:

 >>> HTM_Decoder = SE.SE ('HTM2ISO.se')

SE objects take and return strings as well as file names which is useful 
for translating string variables, doing line-by-line translations and 
for interactive development or verification. A simple way to check a 
substitution set is to use its definitions as test data. The following 
is a section of the definition file HTM2ISO.se:

test_string = '''
&oslash;=(xf8)   #  248  f8
&ugrave;=(xf9)   #  249  f9
&uacute;=(xfa)   #  250  fa
&ucirc;=(xfb)    #  251  fb
&uuml;=(xfc)     #  252  fc
&yacute;=(xfd)   #  253  fd
&thorn;=(xfe)    #  254  fe
&#233;=(xe9)
&#234;=(xea)
&#235;=(xeb)
&#236;=(xec)
&#237;=(xed)
&#238;=(xee)
&#239;=(xef)
'''

 >>> print HTM_Decoder (test_string)

ø=(xf8)   #  248  f8
ù=(xf9)   #  249  f9
ú=(xfa)   #  250  fa
û=(xfb)    #  251  fb
ü=(xfc)     #  252  fc
ý=(xfd)   #  253  fd
þ=(xfe)    #  254  fe
é=(xe9)
ê=(xea)
ë=(xeb)
ì=(xec)
í=(xed)
î=(xee)
ï=(xef)

Another feature of SE is modularity.

 >>> strip_tags = '''
   ~<(.|\x0a)*?>~=(9)               # one tag to one tab
   ~<!--(.|\x0a)*?-->~=(9)          # one comment to one tab
|                                   # run
   "~\x0a[ \x09\x0d\x0a]*~=(x0a)"   # delete empty lines
   ~\t+~=(32)                       # one or more tabs to one space
   ~\x20\t+~=(32)                   # one space and one or more tabs to 
one space
   ~\t+\x20~=(32)                   # one or more tab and one space to 
one space
'''

 >>> HTM_Stripper_Decoder = SE.SE (strip_tags + ' HTM2ISO.se ')   # 
Order doesn't matter

If you write 'strip_tags' to a file, say 'STRIP_TAGS.se' you'd name it 
together with HTM2ISO.se:

 >>> HTM_Stripper_Decoder = SE.SE ('STRIP_TAGS.se  HTM2ISO.se')   # 
Order doesn't matter

Or, if you have two SE objects, one for stripping tags and one for 
decoding the ampersands, you can nest them like this:

 >>> test_string = "<p class=MsoNormal 
style='line-height:110%'><i>Ren&eacute;</i> est un gar&ccedil;on qui 
para&icirc;t plus &acirc;g&eacute;. </p>"

 >>> print Tag_Stripper (HTM_Decoder (test_string))
  René est un garçon qui paraît plus âgé.

Nesting works with file names too, because file names are returned:

 >>> Tag_Stripper (HTM_Decoder ('input_file_name'), 'output_file_name')
'output_file_name'


Frederic



-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unescape HTML entities

Reply via email to