[issue17741] event-driven XML parser

Eli Bendersky Sun, 25 Aug 2013 06:56:41 -0700

Eli Bendersky added the comment:

Looking at this more, the Parser/Builder abstraction in ET leaks badly, 
especially when it comes to incremental parsing. This manifests in (at least) 
two ways:


1. The parser accepted by iterparse has to use the ET-provided TreeBuilder, 
because the C implementation of TreeBuilder has undocumented fields and hooks 
used directly by the parser.
2. Even though iterparse & IncrementalParser accept a "parser" argument, it 
cannot be just any parser. This parser (in addition to using the ET 
TreeBuilder) must implement an undocumented and obscure _setevents method; this 
makes the original intent of the abstraction - hooking up arbitrary parsers to 
the interface, hard to follow.

To *really* fix this, all these capabilities have to be exposed as APIs. 
TreeBuilder has to support an explicit API for collecting and reporting events. 
XMLParser has to call into this API and either not have _setevents at all or 
have something public and documented. Note also that event hookup in the parser 
makes sense in a way, because we don't want events to be fired when not needed 
(for performance). So we don't want any parser to hook up all events for 
firing, and only incremental tree builders to collect them. So things aren't 
quite *so* simple. In addition, since iterparse (an existing, stable API of the 
module) expects a parser argument, it won't be easy creating an 
IncrementalTreeBuilder, because how would we let people using custom parsers 
with iterparse not change any code and yet keep things working?

--

Since the above presents considerable changes in the internals, to get *really* 
right - we do have a problem for the upcoming release because I'm not sure if 
anyone has the time or will to do it. However, the feature is interesting and 
useful, and the code is cleaner than it used to be. So I propose the following 
compromise that can serve us for the 3.4 release:

IncrementalParser stays, but its methods will be renamed to feed & close, which 
is consistent with the current XML parsing APIs (including 
xml.sax.xmlreader.IncrementalParser). In a future release, we may decide to 
create a IncrementalTreeBuilder anyway but then IncrementalParser can be 
layered on top of it and offered as a convenience.

--

To conclude, I don't think that when looked at as exposing an implementation 
detail of iterparse, IncrementalParser is so horrible. iterparse is already its 
own kind of thing somewhat alien to ET, but it's very useful and important. 
IncrementalParser is just replacing iterparse's "parse whole source" behavior 
with incremental feeds followed by a close. This is similar to the APIs 
xml.sax.xmlreader is exposing, so there is a consistency here. Solving "all the 
implementation problems" would certainly be great, but unfortunately no one 
seems to have time for this at the moment.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue17741>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue17741] event-driven XML parser

Reply via email to