[issue43483] Loss of content in simple (but oversize) SAX parsing

Larry Trammell Sat, 13 Mar 2021 15:49:33 -0800


Larry Trammell <ridge...@nwi.net> added the comment:


Not a bug, strictly speaking... more like user abuse.

The parsers (expat as well as SAX) must be able to return content text as a 
sequence of pieces when necessary. For example, as a text sequence interrupted 
by grouping or styling tags (like <span> or <i>).  Or, extensive text blocks 
might need to be subdivided for efficient processing.  Users would expect 
hazards like these and be wary.  But how many users would suspect that a quoted 
string of length 8 characters would be returned in multiple pieces?  Or that an 
entity notation would be split down the middle?  Virtually all existing 
tutorial examples showing content extraction are WRONG -- because the ONLY 
content that can be trusted must be filtered through some kind of aggregator 
object.  How many users will know this instinctively?  

It would be very useful for the parser systems to provide some kind of support 
for text aggregation function.  A guarantee that "small contiguous" text items 
will not be chopped might also be helpful.

----------
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue43483>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43483] Loss of content in simple (but oversize) SAX parsing

Reply via email to