On Wed, 18 Dec 2024 at 21:19, Jonathan Wakely <jwak...@redhat.com> wrote:
>
> std::regex builds a cache of equivalence classes by calling
> std::regex_traits<char>::transform_primary(c) for every char, which then
> calls std::collate<char>::transform which calls strxfrm. On several
> targets strxfrm fails for non-ASCII characters. Because strxfrm has no
> return value reserved to indicate an error, some implementations return
> INT_MAX or SIZE_MAX. This causes std::collate::transform to try to
> allocate a huge buffer, which is either very slow or throws
> std::bad_alloc. We should check errno after calling strxfrm to detect
> errors and then throw a more appropriate exception instead of trying to
> allocate a huge buffer.
>
> Unfortunately the std::collate<C>::_M_transform function has a
> non-throwing exception specifier, so we can't do the error handling
> there.
>
> As well as checking errno, this patch changes std::collate::do_transform
> to use __builtin_alloca for small inputs, and to use RAII to deallocate
> the buffers used for large inputs.
>
> This change isn't sufficient to fix the three std::regex bugs caused by
> the lack of error handling in std::collate::do_transform, we also need
> to make std::regex_traits::transform_primary handle exceptions. This
> change also attempts to make transform_primary closer to the effects
> described in the standard, by not even attempting to use std::collate
> if the locale's std::collate facet has been replaced (see PR 118105).
>
> Arguably, we should not even try to call transform_primary for any char
> values over 127, since they're never valid in locales that use UTF-8 or
> 7-bit ASCII, and probably for other charsets too. Handling 128
> exceptions for every std::regex compilation is very inefficient, but at
> least it now works instead of failing with std::bad_alloc, and no longer
> allocates 128 x 2GB. Maybe for C++26 we could check the locale's
> std::text_encoding and use that to decide whether to cache equivalence
> classes for char values over 127.
>
> I'm unsure if std::regex_traits<C>::transform_primary is supposed to
> convert the string to lower case or not.  The general regex traits
> requirements ([re.req] p20) do say "when character case is not
> considered" but the specification for the std::regex_traits<char> and
> std::regex_traits<wchar_t> specializations ([re.traits] p7) don't say
> anything about that.
>
> libstdc++-v3/ChangeLog:
>
>         PR libstdc++/85824
>         PR libstdc++/94409
>         PR libstdc++/98723
>         PR libstdc++/118105
>         * include/bits/locale_classes.tcc (collate::do_transform): Check
>         errno after calling _M_transform. Use RAII type to manage the
>         buffer and to restore errno.
>         * include/bits/regex.h (regex_traits::transform_primary): Handle
>         exceptions from std::collate::transform and do not try to use
>         std::collate for user-defined facets.
> ---
>
> Tested x86_64-linux.

Pushed to trunk now. Might be backported later.

Reply via email to