New submission from Larry Trammell <ridge...@nwi.net>:

== The Problem ==

I have observed a "loss of data" problem using the Python SAX parser, when 
processing an oversize but very simple machine-generated xhtml file. The file 
represents a single N x 11 data table.  W3C "tidy" reports no xml errors.  The 
table is constructed in an entirely plausible manner, using table, tr, and td 
tags to define the table structure, and p tags to bracket content, which 
consists of small chunks of quoted text.  There is nothing pathological, no 
extraneous whitespace characters, no empty data fields. 

Everything works perfectly in small test cases.  But when a very large number 
of rows are present, a few characters of content strings are occasionally lost. 
I have observed 2 or 6 characters dropped.  But here's the strange part.  The 
pathological behavior disappears (or moves to another location) when one or 
more non-significant whitespace characters are inserted at an arbitrary 
location early in the file... e.g. an extra linefeed before the first tr tag. 

== Context ==

I have observed identical behavior on desktop systems using an Intel Xeon 
E5-1607 or a Core-2 processor, running 32-bit or 64-bit Linux operating 
systems, variously using Python 3.8.5, 3.8, 3.7.3, and 3.5.1.

== Observing the Problem == 

Sorry that the test data is so bulky (even at 0.5% of original size), but bulk 
appears to be a necessary condition to observe the problem. Run the following 
command line.  

python3  EnchXMLTest.py  EnchTestData.html 

The test script invokes the SAX parser and generates messages on stdout. Using 
the original test data as provided, the test should run correctly to 
completion.  Now modify the test data file, deleting the extraneous comment 
line (there is only one) found near the top of the file.  Repeat the test run, 
and this time look for missing content characters in parsed content fields of 
the last record.  
 
== Any guesses? ==

Beyond "user is oblivious," possibly something abnormal can occur at seams 
between large blocks of buffered text.  The presence or absence of an extra 
character early in the data stream results in a corresponding shift in content 
location at the end of the buffer.  Other clues: is it relevant that the 
problem appears in a string field that contains slash characters?

----------
components: XML
files: EnchSAXTest.zip
messages: 388582
nosy: ridgerat1611
priority: normal
severity: normal
status: open
title: Loss of content in simple (but oversize) SAX parsing
type: behavior
versions: Python 3.7, Python 3.8
Added file: https://bugs.python.org/file49872/EnchSAXTest.zip

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue43483>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to