"Joel Jacobson" <j...@compiler.org> writes: > On Tue, Feb 23, 2021, at 18:15, Tom Lane wrote: >> Perl and Javascript believe that \W and \D should match newlines >> regardless of their 's' flag, so there's a case for changing >> \W and \D to match newline regardless of our 'n' flag. 0002 >> attached is the quite trivial patch to do this. I'm not quite >> 100% convinced whether this is a good change to make, but if we're >> going to do it now would be the time.
> [ extensive analysis ] > My opinion is therefore we should change \W to include newlines. Wow, thanks for doing all that work! But OTOH, looking at a corpus taken from Javascript practice seems like it'd inevitably lead to that conclusion, since that is what \W does in Javascript. Whether the regex authors knew the exact rules or not (and I share your suspicions that some of them didn't), if they'd done any testing they'd have been led to write their code that way. Still, I am not convinced that there's much to justify our current definition either. Looking at the existing code shows that the way \W and \D work now was forced by Spencer's decision to make 'n' mode affect complemented character classes in general, since they're just macros for complemented character classes. With this reimplementation, that connection isn't there anymore, so we can change it if we like. Since (AFAICS) the main use of 'n' mode is to make our regexes work more like these other products, bringing \W and \D into line with them seems like a reasonable thing to do. I've also decided after reflection that the patch should indeed create a named "word" character class. That's allowed per POSIX, and it simplifies some aspects of the documentation, since we can rely on referencing the class instead of repeating ourselves. The attached 0001 v2 does that; it's otherwise the same as before. Speaking of documentation, I'm wondering more and more why we're continuing to carry along re_syntax.n. We don't expose that to users in any way, and it has not been maintained nearly as faithfully as the SGML docs. (Looking at the git history, I think I included it in 7bcc6d98f because it replaced re_format.7, which had been there in that directory since Postgres95. But that history is immaterial now that we've got proper user-facing documentation.) regards, tom lane #text/x-diff; name="0001-rework-char-class-escapes-2.patch" [0001-rework-char-class-escapes-2.patch] /home/tgl/pgsql/0001-rework-char-class-escapes-2.patch #text/x-diff; name="0002-DW-always-match-newline.patch" [0002-DW-always-match-newline.patch] /home/tgl/pgsql/0002-DW-always-match-newline.patch