[issue17741] event-driven XML parser

Stefan Behnel Sun, 25 Aug 2013 10:50:41 -0700

Stefan Behnel added the comment:

> Also, even if the new approach is implemented in the next release,
IncrementalParser can stay as a simple synonym to
XMLParser(target=EventBuilder(...)).


No it can't. According to your signature, it accepts a parser instance as 
input. So it can't just create a new one internally.


"""
This is an existing API within ET:

  xml.etree.ElementTree.iterparse(source, events=None, parser=None)

The parser argument is limited: it can't use a custom TreeBuilder. It also
must provide _setevents, so either inherit from XMLParser or just provide
that method in another way [I'll try to improve the documentation to make
all of this explicit]. Whatever we say in this issue, iterparse is here to
stay.
"""

I agree. However, I can't see why this should constrain us when adding a new 
API. Sure, the current implementation puts certain restrictions on the input. 
But I see that as an advantage, not as a constraint. It's currently broken in 
what it accepts as input, so we are free to change that aspect later on. And we 
can now try to get it right for the new API that we add.

Also, AFAICT, passing a parser into iterparse() is a very rare thing to do, if 
only because it doesn't really work, so the likelyhood of breaking code is very 
limited (I'm not saying we should, just thinking about the impact). Note that 
lxml.etree doesn't currently support this at all.


Just to throw some more ideas into the wild, what do you think about making 
this an inherent feature of the XMLParser class itself? That can be done now, 
without major changes to the underlying code. It could be enabled by a new 
keyword argument "collect_events=None" that you could pass into the constructor 
in order to define the set of events. The "events()" method would always be 
available, but would only yield events if there actually are any. Otherwise, it 
would just terminate and be done. So, unless you pass it a set of event names 
at init time, the parser wouldn't collect events and you wouldn't get any out 
of it. Sounds like a clean interface to me.

Collecting the events would work in the same way as it currently works, i.e. it 
would call the target's start() method, and whatever that returns will be added 
as value to the events as a tuple ("start", target_return_value). That means 
that it's up to the parser target to determine what the event iteration 
returns. Can be elements, can be None, can be something else.

As for backwards compatibility with iterparse(), I don't see a problem with 
officially documenting iterparse() as requiring an XMLParser if a parser is 
provided, and as reconfiguring that parser to its needs. It already does that 
anyway and it currently doesn't work for anything but an XMLParser.


Also, on a related note, we might (!) consider the addition of the _setevents() 
method a bug in CPython 3.3. It didn't exist in ElementTree before it was 
merged with cElementTree. Instead, the expected interface was (apparently) that 
the parser that is passed in has a "_parser" attribute that holds an Expat 
parser, and that iterparse() could reconfigure according for itself. Both are 
not public APIs and even differed between ET and cET, so I don't see any harm 
being done, but it would be good to have one way to do it. OTOH, if that way is 
_setevents(), then be it. The advantage would be that this interface is at 
least independent of Expat and could potentially be implemented by other types 
of parsers.

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue17741>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue17741] event-driven XML parser

Reply via email to