Re: Regression with large XML data input

Michael Paquier Thu, 24 Jul 2025 17:08:37 -0700

On Fri, Jul 25, 2025 at 01:25:48AM +0200, Jim Jones wrote:
> On 24.07.25 21:23, Tom Lane wrote:
>> However, when testing on RHEL8 with libxml2 2.9.7, indeed
>> I get "Huge input lookup" with our current code but no
>> failure with f68d6aabb7e2^.
>>
>> The way I interpret these results is that in older libxml2 versions,
>> xmlParseBalancedChunkMemory is missing an XML_ERR_RESOURCE_LIMIT check
>> that does exist in newer versions.  So even if we were to do some kind
>> of reversion, it would only prevent the error in libxml2 versions that
>> lack that check.  And in those versions we'd probably be exposing
>> ourselves to resource-exhaustion problems.


Linux distributions may not seem very eager to add this check, though?
The top of debian GID uses a version of libxml2 where the difference
shows up, so it means that we have years ahead with the old code.

If it were discussing things from the perspective where this new code
was added after a major version bump of Postgres, I would not argue
much about that: breakages happen every year and users adapt their
applications to it.  Here, however, we are talking about a change in a
stable branch, across a minor version, which should be a bit more
flawless from a user perspective?  I may be influenced by the point of
seeing a customer impacted by that, of course, there is no denying
that.  The point is that this enforces one behavior that's part of
2.13 and onwards.  Versions of PG before f68d6aabb7e2 were still OK
with that and the new code of Postgres closes the door completely.
Even if the behavior that Postgres had when linking with libxml2 2.12
or older was kind of "accidendal" because
xmlParseBalancedChunkMemory() lacked the XML_ERR_RESOURCE_LIMIT check,
it was there, and users relied on that.

One possibility that I could see here for stable branches would be to
make the code a bit smarter depending on LIBXML_VERSION, where we
could keep the new code for 2.13 onwards, but keep a compatible
behavior with 2.12 and older, based on xmlParseBalancedChunkMemory().

>> On the whole I'm thinking more and more that we don't want to
>> touch this.  Our recommendation for processing multi-megabyte
>> chunks of XML should be "don't".  Unless we want to find or
>> write a replacement for libxml2 ... which we have discussed,
>> but so far nothing's happened.
> 
> I also believe that addressing this limitation may not be worth the
> associated risks. Moreover, a 10MB text node is rather large and
> probably exceeds the needs of most users.

Yeah, still some people use it, so while I am OK to accept this as a
conclusion at the end and send back to this thread that workarounds
are required in applications to split the inputs, that was really
surprising.  (Aka from the point of view of the customer whose
application suddenly fails after what should have been a "simple"
minor update.)
--
Michael

signature.asc
Description: PGP signature

Re: Regression with large XML data input

Reply via email to