Dimitry Andric <dim_at_FreeBSD.org> wrote on
Date: Fri, 21 Apr 2023 10:38:05 UTC :

> On 21 Apr 2023, at 12:01, Ronald Klop <ronald-li...@klop.ws> wrote:
> > Van: Poul-Henning Kamp <p...@phk.freebsd.dk>
> > Datum: maandag, 17 april 2023 23:06
> > Aan: curr...@freebsd.org
> > Onderwerp: find(1): I18N gone wild ?
> > This surprised me:
> > 
> > # mkdir /tmp/P
> > # cd /tmp/P
> > # touch FOO
> > # touch bar
> > # env LANG=C.UTF-8 find . -name '[A-Z]*' -print
> > ./FOO
> > # env LANG=en_US.UTF-8 find . -name '[A-Z]*' -print
> > ./FOO
> > ./bar
> > 
> > Really ?!
> ...
> > My Mac and a Linux server only give ./FOO in both cases. Just a 2 cents 
> > remark.
> 
> Same here. However, I have read that with unicode, you should *never*
> use [A-Z] or [0-9], but character classes instead. That seems to give
> both files on macOS and Linux with [[:alpha:]]:
> 
> $ LANG=en_US.UTF-8 find . -name '[[:alpha:]]*' -print
> ./BAR
> ./foo
> 
> and only the lowercase file with [[:lower:]]:
> 
> $ LANG=en_US.UTF-8 find . -name '[[:lower:]]*' -print
> ./foo
> 
> But on FreeBSD, these don't work at all:
> 
> $ LANG=en_US.UTF-8 find . -name '[[:alpha:]]*' -print
> <nothing>
> 
> $ LANG=en_US.UTF-8 find . -name '[[:lower:]]*' -print
> <nothing>
> 
> This is an interesting rabbit hole... :)

FreeBSD:

     -name pattern
             True if the last component of the pathname being examined matches
             pattern.  Special shell pattern matching characters (“[”, “]”,
             “*”, and “?”) may be used as part of pattern.  These characters
             may be matched explicitly by escaping them with a backslash
             (“\”).

I conclude that [[:alpha:]] and [[:lower:]] were not
considered "Special shell pattern"s. "man glob"
indicates it is a shell specific builtin.

macOS says similarly. Different shells, different
pattern notations and capabilities? Well, "man bash"
reports:

QUOTE
      Pattern Matching

        . . .
              Within [ and ], character classes can be specified using the 
syntax [:class:], where class is one of the following classes defined in the 
POSIX standard:
              alnum alpha ascii blank cntrl digit graph lower print punct space 
upper word xdigit
              A character class matches any character belonging to that class.  
The word character class matches letters, digits, and the character _.

              Within [ and ], an equivalence class can be specified using the 
syntax [=c=], which matches all characters with the same collation weight (as 
defined by the current locale) as the
              character c.

              Within [ and ], the syntax [.symbol.] matches the collating 
symbol symbol.

END QUOTE

"man zsh" does not document patterns but:

sh-3.2$ echo $SHELL
/bin/zsh
sh-3.2$ find . -name '[[:lower:]]*' -print
./bar

% ls -Tldt /bin/*sh
-r-xr-xr-x  1 root  wheel  1326688 Feb  9 01:39:53 2023 /bin/bash
-rwxr-xr-x  2 root  wheel  1153216 Feb  9 01:39:53 2023 /bin/csh
-rwxr-xr-x  1 root  wheel   307232 Feb  9 01:39:53 2023 /bin/dash
-r-xr-xr-x  1 root  wheel  2598864 Feb  9 01:39:53 2023 /bin/ksh
-rwxr-xr-x  1 root  wheel   134000 Feb  9 01:39:53 2023 /bin/sh
-rwxr-xr-x  2 root  wheel  1153216 Feb  9 01:39:53 2023 /bin/tcsh
-rwxr-xr-x  1 root  wheel  1377616 Feb  9 01:39:53 2023 /bin/zsh

But in each, even bash,

% echo $SHELL
/bin/zsh


With "find" not being part of the kernel, Linux may have
a number of variations across the operating systems.
Picking one . . .

openSUSE tumbleweed:

       -name pattern
              Base  of file name (the path with the leading directories 
removed) matches shell pattern pattern.  Because the leading directories are 
removed, the file names considered for a match
              with -name will never include a slash, so `-name a/b' will never 
match anything (you probably need to use -path instead).  A warning is issued 
if you try to do this, unless the  en-
              vironment variable POSIXLY_CORRECT is set.  The metacharacters 
(`*', `?', and `[]') match a `.' at the start of the base name (this is a 
change in findutils-4.2.2; see section STAN-
              DARDS CONFORMANCE below).  To ignore a directory and the files 
under it, use -prune rather than checking every file in the tree; see an 
example in the description  of  that  action.
              Braces  are  not  recognised as being special, despite the fact 
that some shells including Bash imbue braces with a special meaning in shell 
patterns.  The filename matching is per-
              formed with the use of the fnmatch(3) library function.  Don't 
forget to enclose the pattern in quotes in order to protect it from expansion 
by the shell.

"man 3 fnmatch" says:

       The fnmatch() function checks whether the string argument matches the 
pattern argument, which is a shell wildcard pattern (see glob(7)).

"man 7 glob" (not shell specific) in turn has a section on
"Character classes and internationalization" that reports:

QUOTE
. . .
. . . Therefore, POSIX extended the bracket notation  greatly,
       both  for  wildcard  patterns  and  for regular expressions.  In the 
above we saw three types of items that can occur in a bracket expression: 
namely (i) the negation, (ii) explicit single
       characters, and (iii) ranges.  POSIX specifies ranges in an 
internationally more useful way and adds three more types:

       (iii) Ranges X-Y comprise all characters that fall between X and Y 
(inclusive) in the current collating sequence as defined by the LC_COLLATE 
category in the current locale.

       (iv) Named character classes, like

       [:alnum:]  [:alpha:]  [:blank:]  [:cntrl:]
       [:digit:]  [:graph:]  [:lower:]  [:print:]
       [:punct:]  [:space:]  [:upper:]  [:xdigit:]

       so that one can say "[[:lower:]]" instead of "[a-z]", and have things 
work in Denmark, too, where there are three letters past 'z' in the alphabet.  
These character classes are defined  by
       the LC_CTYPE category in the current locale.

       (v) Collating symbols, like "[.ch.]" or "[.a-acute.]", where the string 
between "[." and ".]" is a collating element defined for the current locale.  
Note that this may be a multicharacter
       element.

       (vi) Equivalence class expressions, like "[=a=]", where the string 
between "[=" and "=]" is any collating element from its equivalence class, as 
defined for the current locale.  For  exam-
       ple, "[[=a=]]" might be equivalent to "[aáàäâ]", that is, to 
"[a[.a-acute.][.a-grave.][.a-umlaut.][.a-circumflex.]]".
END QUOTE

# file /usr/bin/sh
/usr/bin/sh: symbolic link to bash


Seems like: pick your shell (as shown by echo $SHELL) and
that picks the pattern match rules used. (May be controllable
in the specific shell.)

===
Mark Millard
marklmi at yahoo.com


Reply via email to