On Wed, Jan 23, 2019 at 12:55 PM Nick Wellnhofer <wellnho...@aevum.de>
wrote:

> The commit obviously also affected documents that didn't need encoding
> conversion. I didn't realize that.


Aha! I noticed that the chromium link you sent mentions a >32KB string
which gets converted to a >64KB string, which sounded suspiciously similar.
Looks like lxml's feed() function [1] is doing the same thing. I don't know
too much about Python's C API, but [2] [3] suggests lxml is using a
deprecated macro and giving libxml2 a multibyte buffer even though the
input would fit into pure ASCII. This explains why it behaved differently
than xmllint.

[1] https://github.com/lxml/lxml/blob/master/src/lxml/parser.pxi#L1242
[2]
https://stackoverflow.com/questions/26079392/how-is-unicode-represented-internally-in-python
[3] https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_AS_DATA

I also noticed that feed() is doing something special with the first 4
bytes, giving them to _htmlCtxtResetPush() instead of htmlParseChunk(). So
the discussion about buffer boundaries might be slightly incorrect.

At least we know that the issue is isolated
> to 2.9.8. Thanks for your efforts!
>

Yes, thank you. Now it's clear that my immediate issue is solved and
version 2.9.9 works. So I probably won't look into this much further.

I guess it's up to you to decide what to do next, and if any libxml2
changes are needed. It would be good to add some tests to decrease the
likelihood that this issue or something similar happens again. For that,
you might still need to isolate the root cause further, and create a pure C
test case. (Maybe based on a test case from chromium instead of mine.) But
of course it's up to you to determine the priority of that. Thanks again
for your help, and good luck if you decide to continue.

Tomi
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml

Reply via email to