I have a puzzle over how lxml & entities should be 'preserved' code below illustrates. To preserve I change & --> & in the source and add resolve_entities=False to the parser definition. The escaping means we only have one kind of entity & which means lxml will preserve it. For whatever reason lxml won't preserve character entities eg !.

The simple parse from string and conversion tostring shows that the parsing at 
least took notice of it.

However, I want to create a tuple tree so have to use tree.text, 
tree.getchildren() and tree.tail for access.

When I use those I expected to have to undo the escaping to get back the original entities, but it seems they are already done.

Good for me, but if the tree knows how it was created (tostring shows that) why 
is it ignored with attribute access?

if __name__=='__main__':
    from lxml import etree as ET
    #initial xml
    xml = b'<a attr="&mysym; &lt; &amp; &gt; &#33;">aaaaa &mysym; &lt; &amp; &gt; 
&#33; AAAAA</a>'
    #escaped xml
    xxml = xml.replace(b'&',b'&amp;')

    myparser = ET.XMLParser(resolve_entities=False)
    tree = ET.fromstring(xxml,parser=myparser)

    #use tostring
    print(f'using tostring\n{xxml=!r}\n{ET.tostring(tree)=!r}\n')

    #now access the items using text & children & text
    print(f'using 
attributes\n{tree.text=!r}\n{tree.getchildren()=!r}\n{tree.tail=!r}')

when run I see this

$ python tmp/tlp.py
using tostring
xxml=b'<a attr="&amp;mysym; &amp;lt; &amp;amp; &amp;gt; &amp;#33;">aaaaa &amp;mysym; &amp;lt; &amp;amp; &amp;gt; &amp;#33; AAAAA</a>' ET.tostring(tree)=b'<a attr="&amp;mysym; &amp;lt; &amp;amp; &amp;gt; &amp;#33;">aaaaa &amp;mysym; &amp;lt; &amp;amp; &amp;gt; &amp;#33; AAAAA</a>'

using attributes
tree.text='aaaaa &mysym; &lt; &amp; &gt; &#33; AAAAA'
tree.getchildren()=[]
tree.tail=None
--
Robin Becker
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to