Jim Meyering wrote: > Bruno Haible wrote: >> Paolo, >> >>> > [=e=] to match "e" as well as accented versions like é, è and ê). >>> > That is the one feature that you get with glibc, and that you would >>> > sacrifice when building --with-included-regex. >>> >>> I agree. It's up to distros to choose, of course. >> >> If you are on the point of sacrificing a glibc feature in many programs, >> then IMO you should first talk with the glibc people to see what alternative >> they can offer. > > People who build the tools currently have the choice of using > --with-included-regex or > --without-included-regex > > Note that putting equivalence classes (and backrefs) aside, the > interpretation of ranges is done in dfa.c, which means the vast > majority of range uses never even require use of regexp code. > > However, backreferences force these tools to skip the DFA-based > optimization and resort to running the regexp code. In that case, > there is a dichotomy. Adding a backreference to a range-including > regexp would have the surprising consequence of changing how that range > is interpreted when the tool is built to use glibc's regexp code. > > Thus, if we go this route, we are effectively saying > that people who want self-consistent regex-handling > in our tools must build with --with-included-regex or end > up causing subtle problems. > > That's a big leap. > I'm not saying I won't take upstream grep over the edge, > but I'd like to hear what a few distro-maintainers think.
To clarify... I like Arnold's proposal to make regex range handling sane and locale-independent. It goes like this (at least for gawk, grep and sed): change how dfa.c interprets ranges like [a-z] change how gnulib's reg* code handles ranges Always use the included regex code (the one from gnulib), so that its interpretation is consistent with that of dfa.c. Grep's current upstream default is to build --with-included-regex, which makes grep use glibc's regex code. To make this proposed change go through, that configure-time option would have to be eliminated, so that we always build with the gnulib-provided regex code. Of course, if glibc ever changes, we can detect that and automatically prefer it when possible.