On Wed, 18 Dec 2024 at 21:19, Jonathan Wakely <jwak...@redhat.com> wrote: > > std::regex builds a cache of equivalence classes by calling > std::regex_traits<char>::transform_primary(c) for every char, which then > calls std::collate<char>::transform which calls strxfrm. On several > targets strxfrm fails for non-ASCII characters. Because strxfrm has no > return value reserved to indicate an error, some implementations return > INT_MAX or SIZE_MAX. This causes std::collate::transform to try to > allocate a huge buffer, which is either very slow or throws > std::bad_alloc. We should check errno after calling strxfrm to detect > errors and then throw a more appropriate exception instead of trying to > allocate a huge buffer. > > Unfortunately the std::collate<C>::_M_transform function has a > non-throwing exception specifier, so we can't do the error handling > there. > > As well as checking errno, this patch changes std::collate::do_transform > to use __builtin_alloca for small inputs, and to use RAII to deallocate > the buffers used for large inputs. > > This change isn't sufficient to fix the three std::regex bugs caused by > the lack of error handling in std::collate::do_transform, we also need > to make std::regex_traits::transform_primary handle exceptions. This > change also attempts to make transform_primary closer to the effects > described in the standard, by not even attempting to use std::collate > if the locale's std::collate facet has been replaced (see PR 118105). > > Arguably, we should not even try to call transform_primary for any char > values over 127, since they're never valid in locales that use UTF-8 or > 7-bit ASCII, and probably for other charsets too. Handling 128 > exceptions for every std::regex compilation is very inefficient, but at > least it now works instead of failing with std::bad_alloc, and no longer > allocates 128 x 2GB. Maybe for C++26 we could check the locale's > std::text_encoding and use that to decide whether to cache equivalence > classes for char values over 127. > > I'm unsure if std::regex_traits<C>::transform_primary is supposed to > convert the string to lower case or not. The general regex traits > requirements ([re.req] p20) do say "when character case is not > considered" but the specification for the std::regex_traits<char> and > std::regex_traits<wchar_t> specializations ([re.traits] p7) don't say > anything about that. > > libstdc++-v3/ChangeLog: > > PR libstdc++/85824 > PR libstdc++/94409 > PR libstdc++/98723 > PR libstdc++/118105 > * include/bits/locale_classes.tcc (collate::do_transform): Check > errno after calling _M_transform. Use RAII type to manage the > buffer and to restore errno. > * include/bits/regex.h (regex_traits::transform_primary): Handle > exceptions from std::collate::transform and do not try to use > std::collate for user-defined facets. > --- > > Tested x86_64-linux.
Pushed to trunk now. Might be backported later.