On Fri, Jul 25, 2025 at 01:25:48AM +0200, Jim Jones wrote: > On 24.07.25 21:23, Tom Lane wrote: >> However, when testing on RHEL8 with libxml2 2.9.7, indeed >> I get "Huge input lookup" with our current code but no >> failure with f68d6aabb7e2^. >> >> The way I interpret these results is that in older libxml2 versions, >> xmlParseBalancedChunkMemory is missing an XML_ERR_RESOURCE_LIMIT check >> that does exist in newer versions. So even if we were to do some kind >> of reversion, it would only prevent the error in libxml2 versions that >> lack that check. And in those versions we'd probably be exposing >> ourselves to resource-exhaustion problems.
Linux distributions may not seem very eager to add this check, though? The top of debian GID uses a version of libxml2 where the difference shows up, so it means that we have years ahead with the old code. If it were discussing things from the perspective where this new code was added after a major version bump of Postgres, I would not argue much about that: breakages happen every year and users adapt their applications to it. Here, however, we are talking about a change in a stable branch, across a minor version, which should be a bit more flawless from a user perspective? I may be influenced by the point of seeing a customer impacted by that, of course, there is no denying that. The point is that this enforces one behavior that's part of 2.13 and onwards. Versions of PG before f68d6aabb7e2 were still OK with that and the new code of Postgres closes the door completely. Even if the behavior that Postgres had when linking with libxml2 2.12 or older was kind of "accidendal" because xmlParseBalancedChunkMemory() lacked the XML_ERR_RESOURCE_LIMIT check, it was there, and users relied on that. One possibility that I could see here for stable branches would be to make the code a bit smarter depending on LIBXML_VERSION, where we could keep the new code for 2.13 onwards, but keep a compatible behavior with 2.12 and older, based on xmlParseBalancedChunkMemory(). >> On the whole I'm thinking more and more that we don't want to >> touch this. Our recommendation for processing multi-megabyte >> chunks of XML should be "don't". Unless we want to find or >> write a replacement for libxml2 ... which we have discussed, >> but so far nothing's happened. > > I also believe that addressing this limitation may not be worth the > associated risks. Moreover, a 10MB text node is rather large and > probably exceeds the needs of most users. Yeah, still some people use it, so while I am OK to accept this as a conclusion at the end and send back to this thread that workarounds are required in applications to split the inputs, that was really surprising. (Aka from the point of view of the customer whose application suddenly fails after what should have been a "simple" minor update.) -- Michael
signature.asc
Description: PGP signature