[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-19 Thread Larry Trammell
Larry Trammell added the comment: Check out issues 43560 (an enhancement issue to improve handling of small XML content chunks) 43561 (a documentation issue to give users warning about the hazard in the interim before the changes are implemented) -- ___

[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-18 Thread Eric V. Smith
Eric V. Smith added the comment: I'd add a note to the docs about it, then open a feature request to change the behavior. You could turn this issue into a documentation fix. Unfortunately I don't know if there's a core dev who pays attention to the XML parsers. But I can probably find out.

[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-18 Thread Larry Trammell
Larry Trammell added the comment: Eric, now that you know as much as I do about the nature and scope of the peculiar parsing behavior, do you have any suggestions about how to proceed from here? -- ___ Python tracker

[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-17 Thread Larry Trammell
Larry Trammell added the comment: If there were a decision NOT TO FIX... maybe then it would make sense to consider documentation patches at a higher priority. That way, SAX-Python (and expat-Python) tutorials across the Web could start patching their presentations accordingly. --

[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-17 Thread Larry Trammell
Larry Trammell added the comment: I think the existing ContentHandler.characters(content) documentation DOES say that the text can come back in chunks... but it is subtle. It might be possible to say more explicitly that any content no matter how small is allowed to be returned as any numbe

[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-17 Thread Larry Trammell
Larry Trammell added the comment: Oh, and whether this affects only content text... I would presume so, but I don't know how to tell for sure. Unspecified behaviors can be very mysterious! -- ___ Python tracker

[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-17 Thread Eric V. Smith
Eric V. Smith added the comment: I think that's good text, once the enhancement is made. But for existing versions of python, shouldn't we just document that the text might come back in chunks? I don't have a feel for what the limit should be. -- ___

[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-17 Thread Larry Trammell
Larry Trammell added the comment: Great minds think alike I guess... I was thinking of a much smaller carryover size... maybe 1K. With individual text blocks longer than that, the user will almost certainly be dealing with collecting and aggregating content text anyway, and in that case, th

[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-17 Thread Eric V. Smith
Eric V. Smith added the comment: Thanks, that's very helpful. Does this only affect content text? This should definitely be documented. As far as changing it, I think the best thing to do is say that if the context text is less than some size (I don't know, maybe 1MB?) that it's guaranteed t

[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-17 Thread Larry Trammell
Larry Trammell added the comment: Sure... I'll cut and paste some of the text I was organizing to go into a possible new issue page. The only relevant documentation I could find was in the "xml.sax.handler" page in the Python 3.9.2 Documentation for the Python Standard Library (as it has b

[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-17 Thread Eric V. Smith
Eric V. Smith added the comment: Could you give an example (using a list of callbacks and values or something) that shows how it's behaving that you think is problematic? That's the part I'm not understanding. This doesn't have to be a real example, just show what the user is getting that's

[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-17 Thread Larry Trammell
Larry Trammell added the comment: Assuming that my understanding is completely correct, the situation is that the xml parser has an unspecified behavior. This is true in any text content handler, at any time, and applies to the expat parser as well as SAX. In some rare cases, the behavior o

[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-15 Thread Eric V. Smith
Eric V. Smith added the comment: I think we could document where a "quoted string of length 8 characters would be returned in multiple pieces" occurs. Which API is that? If we change that, and if we call it an enhancement instead of a bug fix, then it can't be backported. It would be worth d

[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-15 Thread Larry Trammell
Larry Trammell added the comment: I can't find any real errors in documentation. There are subtle design and implementation decisions that result in unexpected rare side effects. After processing hundreds of thousands of lines one way, why would the parser suddenly decide to process the ne

[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-15 Thread Eric V. Smith
Eric V. Smith added the comment: Perhaps you could open a documentation bug? I think specific examples of where the documentation is wrong, and how it could be improved, would be helpful. Thanks! -- nosy: +eric.smith ___ Python tracker

[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-13 Thread Larry Trammell
Larry Trammell added the comment: Not a bug, strictly speaking... more like user abuse. The parsers (expat as well as SAX) must be able to return content text as a sequence of pieces when necessary. For example, as a text sequence interrupted by grouping or styling tags (like or ). Or, ext

[issue43483] Loss of content in simple (but oversize) SAX parsing

2021-03-12 Thread Larry Trammell
New submission from Larry Trammell : == The Problem == I have observed a "loss of data" problem using the Python SAX parser, when processing an oversize but very simple machine-generated xhtml file. The file represents a single N x 11 data table. W3C "tidy" reports no xml errors. The table