[llvm-bugs] [Bug 131516] [libc++] ``: Character class `[\W\D]` fails to match alphabetic characters

LLVM Bugs via llvm-bugs Sun, 16 Mar 2025 06:43:23 -0700

Issue	131516
Summary	[libc++] `<regex>`: Character class `[\W\D]` fails to match alphabetic characters
Labels	libc++
Assignees
Reporter	muellerj2

    The (ECMAScript) regular _expression_ `[\W\D]` describes a character class that matches the union of (a) all non-alphanumeric characters and (b) all non-digits. So effectively, the character class should be equivalent to `[\D]` and thus match all non-digits. However, libc++'s regex implementation only matches non-alphanumeric characters.


Test case:
```
#include <iostream>
#include <regex>

using namespace std;

int main()
{
 regex re(R"([\W\D])");
    cout << "matches alphabetic: " << regex_match("a", re) << '\n'
         << "matches digit: " << regex_match("0", re) << '\n' 
         << "matches non-alphanumeric: " << regex_match(".", re);
    
    return 0;
}
```
https://godbolt.org/z/YdvY4Pb6a

This prints:
```
matches alphabetic: 0
matches digit: 0
matches non-alphanumeric: 1
```

But it should print (as MSVC STL and libstdc++ do here):
```
matches alphabetic: 1
matches digit: 0
matches non-alphanumeric: 1
```

The problem lies here:
https://github.com/llvm/llvm-project/blob/215c0d2b651dc757378209a3edaff1a130338dd8/libcxx/include/regex#L2139-L2141

The negated character classes are bitwise or'ed, but De Morgan's law says that `(not w) or (not d) = not (w and d)`, so the bit masks should really be bitwise and'ed. 

But bitwise and'ing is problematic as well, because the standard only provides a guarantee that bitwise or'ing works, but doesn't state that bitwise and'ing corresponds to the intersection of the character classes (see [\[re.grammar/9\]](https://eel.is/c++draft/re.grammar#9)). Maybe and'ing will still work for libc++'s `std::regex_traits<char>` and `std::regex_traits<wchar_t>` traits classes (although I haven't checked that), but it might not do the right thing for some user-provided traits classes.

_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

[llvm-bugs] [Bug 131516] [libc++] ``: Character class `[\W\D]` fails to match alphabetic characters

Reply via email to