Re: unescape HTML entities

Rares Vernica Wed, 01 Nov 2006 16:53:46 -0800

Hi,

I downloades 2.2 beta, just to be sure I have the same version as you 
specify. (The file names are no longer funny.) Anyway, it does not seem 
to do as you said:


In [14]: import SE

In [15]: SE.version
-------> SE.version()
Out[15]: 'SE 2.2 beta - SEL 2.2 beta'

In [16]: HTM_Decoder = SE.SE ('HTM2ISO.se')

In [17]: test_string = '''
    ....: &oslash;=(xf8)   #  248  f8
    ....: &ugrave;=(xf9)   #  249  f9
    ....: &uacute;=(xfa)   #  250  fa
    ....: &ucirc;=(xfb)    #  251  fb
    ....: &uuml;=(xfc)     #  252  fc
    ....: &yacute;=(xfd)   #  253  fd
    ....: &thorn;=(xfe)    #  254  fe
    ....: &#233;=(xe9)
    ....: &#234;=(xea)
    ....: &#235;=(xeb)
    ....: &#236;=(xec)
    ....: &#237;=(xed)
    ....: &#238;=(xee)
    ....: &#239;=(xef)
    ....: '''

In [18]: print HTM_Decoder (test_string)

&oslash;=(xf8)   #  248  f8
&ugrave;=(xf9)   #  249  f9
&uacute;=(xfa)   #  250  fa
&ucirc;=(xfb)    #  251  fb
&uuml;=(xfc)     #  252  fc
&yacute;=(xfd)   #  253  fd
&thorn;=(xfe)    #  254  fe
&#233;=(xe9)
&#234;=(xea)
&#235;=(xeb)
&#236;=(xec)
&#237;=(xed)
&#238;=(xee)
&#239;=(xef)


In [19]:

Thanks,
Ray



Frederic Rentsch wrote:
> Rares Vernica wrote:
>> Hi,
>>
>> How can I unescape HTML entities like "&nbsp;"?
>>
>> I know about xml.sax.saxutils.unescape() but it only deals with "&amp;", 
>> "&lt;", and "&gt;".
>>
>> Also, I know about htmlentitydefs.entitydefs, but not only this 
>> dictionary is the opposite of what I need, it does not have "&nbsp;".
>>
>> It has to be in python 2.4.
>>
>> Thanks a lot,
>> Ray
>>
> One way is this:
> 
>  >>> import SE                                                      # 
> Download from http://cheeseshop.python.org/pypi/SE/2.2%20beta
>  >>> SE.SE ('HTM2ISO.se')('input_file_name', 'output_file_name')    # 
> HTM2ISO.se is included
> 'output_file_name'
> 
> For repeated translations the SE object would be assigned to a variable:
> 
>  >>> HTM_Decoder = SE.SE ('HTM2ISO.se')
> 
> SE objects take and return strings as well as file names which is useful 
> for translating string variables, doing line-by-line translations and 
> for interactive development or verification. A simple way to check a 
> substitution set is to use its definitions as test data. The following 
> is a section of the definition file HTM2ISO.se:
> 
> test_string = '''
> &oslash;=(xf8)   #  248  f8
> &ugrave;=(xf9)   #  249  f9
> &uacute;=(xfa)   #  250  fa
> &ucirc;=(xfb)    #  251  fb
> &uuml;=(xfc)     #  252  fc
> &yacute;=(xfd)   #  253  fd
> &thorn;=(xfe)    #  254  fe
> &#233;=(xe9)
> &#234;=(xea)
> &#235;=(xeb)
> &#236;=(xec)
> &#237;=(xed)
> &#238;=(xee)
> &#239;=(xef)
> '''
> 
>  >>> print HTM_Decoder (test_string)
> 
> ø=(xf8)   #  248  f8
> ù=(xf9)   #  249  f9
> ú=(xfa)   #  250  fa
> û=(xfb)    #  251  fb
> ü=(xfc)     #  252  fc
> ý=(xfd)   #  253  fd
> þ=(xfe)    #  254  fe
> é=(xe9)
> ê=(xea)
> ë=(xeb)
> ì=(xec)
> í=(xed)
> î=(xee)
> ï=(xef)
> 
> Another feature of SE is modularity.
> 
>  >>> strip_tags = '''
>    ~<(.|\x0a)*?>~=(9)               # one tag to one tab
>    ~<!--(.|\x0a)*?-->~=(9)          # one comment to one tab
> |                                   # run
>    "~\x0a[ \x09\x0d\x0a]*~=(x0a)"   # delete empty lines
>    ~\t+~=(32)                       # one or more tabs to one space
>    ~\x20\t+~=(32)                   # one space and one or more tabs to 
> one space
>    ~\t+\x20~=(32)                   # one or more tab and one space to 
> one space
> '''
> 
>  >>> HTM_Stripper_Decoder = SE.SE (strip_tags + ' HTM2ISO.se ')   # 
> Order doesn't matter
> 
> If you write 'strip_tags' to a file, say 'STRIP_TAGS.se' you'd name it 
> together with HTM2ISO.se:
> 
>  >>> HTM_Stripper_Decoder = SE.SE ('STRIP_TAGS.se  HTM2ISO.se')   # 
> Order doesn't matter
> 
> Or, if you have two SE objects, one for stripping tags and one for 
> decoding the ampersands, you can nest them like this:
> 
>  >>> test_string = "<p class=MsoNormal 
> style='line-height:110%'><i>Ren&eacute;</i> est un gar&ccedil;on qui 
> para&icirc;t plus &acirc;g&eacute;. </p>"
> 
>  >>> print Tag_Stripper (HTM_Decoder (test_string))
>   René est un garçon qui paraît plus âgé.
> 
> Nesting works with file names too, because file names are returned:
> 
>  >>> Tag_Stripper (HTM_Decoder ('input_file_name'), 'output_file_name')
> 'output_file_name'
> 
> 
> Frederic
> 
> 
> 

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unescape HTML entities

Reply via email to