I have a puzzle over how lxml & entities should be 'preserved' code below illustrates. To preserve I change & --> &
in the source and add resolve_entities=False to the parser definition. The escaping means we only have one kind of
entity & which means lxml will preserve it. For whatever reason lxml won't preserve character entities eg !.
The simple parse from string and conversion tostring shows that the parsing at
least took notice of it.
However, I want to create a tuple tree so have to use tree.text,
tree.getchildren() and tree.tail for access.
When I use those I expected to have to undo the escaping to get back the original entities, but it seems they are
already done.
Good for me, but if the tree knows how it was created (tostring shows that) why
is it ignored with attribute access?
if __name__=='__main__':
from lxml import etree as ET
#initial xml
xml = b'<a attr="&mysym; < & > !">aaaaa &mysym; < & >
! AAAAA</a>'
#escaped xml
xxml = xml.replace(b'&',b'&')
myparser = ET.XMLParser(resolve_entities=False)
tree = ET.fromstring(xxml,parser=myparser)
#use tostring
print(f'using tostring\n{xxml=!r}\n{ET.tostring(tree)=!r}\n')
#now access the items using text & children & text
print(f'using
attributes\n{tree.text=!r}\n{tree.getchildren()=!r}\n{tree.tail=!r}')
when run I see this
$ python tmp/tlp.py
using tostring
xxml=b'<a attr="&mysym; &lt; &amp; &gt; &#33;">aaaaa &mysym; &lt; &amp; &gt;
&#33; AAAAA</a>'
ET.tostring(tree)=b'<a attr="&mysym; &lt; &amp; &gt; &#33;">aaaaa &mysym; &lt; &amp;
&gt; &#33; AAAAA</a>'
using attributes
tree.text='aaaaa &mysym; < & > ! AAAAA'
tree.getchildren()=[]
tree.tail=None
--
Robin Becker
--
https://mail.python.org/mailman/listinfo/python-list