bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P

Paul Eggert Mon, 09 Jan 2023 15:13:20 -0800

On 1/9/23 11:51, Ævar Arnfjörð Bjarmason wrote:

        /b:
        155781
        (*UCP)/b:
        46035
        /s:
        0
        (*UCP)/s:
        0
        /w:
        142468
        (*UCP)/w:
        9706


So the output still differs, and some of those differences may or may
not be wanted.

I took a look at the output, and by and large I'd want the differences;that is, I'd want the UCP version, which generates less output. This isbecause several Emacs source files are not UTF-8, and \b has nonsensematches when searching text files encoded via Shift-JIS or Big 5 orwhatever. For this sort of thing, the fewer matches the better.

If all you're doing is matching either ASCII or Japanese text and you
want "locale-aware numbers" it might do the wrong thing.

I'm not seeing much of a problem here. When searching Japanese text, Iwould expect \d and [0-9０-９] (using both ASCII and full-width digits) tobe equivalent so (assuming UCP) it's not a big deal as to which regexyou use, since Japanese text won't contain Bengali (or whatever) digits.And when searching binary data, I'd expect a bunch of garbage no matterhow \d is interpreted.

Here I'm assuming [０-９] (using full-width digits) has the expectedmeaning in PCRE2, i.e., that PCRE2 didn't make the same mistake thatPOSIX made.

bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P

Reply via email to