On 3/30/25 2:34 AM, Robert Elz wrote:
     Date:        Thu, 27 Mar 2025 17:22:03 -0400
     From:        Chet Ramey <chet.ra...@case.edu>
     Message-ID:  <6da17a73-2aac-4fa5-9fa7-5bfff087d...@case.edu>

   | The shell should assume that setting a shell variable means the
   | user wants to modify the shell's locale settings.

Yes, of course, the question is exactly what the user wants &/or what
the shell should impose on the user.

   | > One answer would be "nothing"
   | A bad choice.

I agree, though that would make the shell operate the way just
about all other programs behave, for example, if one does

        ENVIRON["LC_NUMERIC"] = "whatever"

in awk, while that should appear in the environment of something
awk runs for a system() operation, it doesn't affect the way awk
works, or its numeric input or output, any way at all.

   | The shell does quite a lot of things that are different than any other
   | application, including allowing the user to change locale environment
   | variables.

Yes, but it is not entirely alone in that.  awk/make/env/... allow the
environment to be altered, but none of them notice that it is a locale
environment variable that is being altered, and adapt their own behaviour
to match, or none that I'm aware of, so having shells do the same would
not necessarily be all that remarkable, even if it is perhaps sub-optimal.

That's obviously an implementation choice, and can happen to various
degrees. For instance, GNU awk understands the TEXTDOMAIN variable and
provides builtin functions for translating messages, but doesn't look at
LC_* or LANG except at startup. There's not really much in make that would
be modified based on changing locale variables, and nothing at all in env.
However, languages like, say, python that allow users to set variables do
offer methods to change the interpreter behavior based on programmers
setting variables or calling builtin locale functions.


   | That doesn't have to be all the shell does. The precedence hierarchy is
   | well-understood; there's nothing stopping the shell from implementing it:

No there's not, but I'm not sure that's the ideal result.   For example,
if a script does LC_CTYPE=C (or something similar) should it break if the
user happened to have set LC_ALL=something-different in the environment?

What does `break' mean?

If any other program does setlocale(LC_CTYPE, "C"); that has effect
for any future operations, regardless of what might have been in the
environment, why should a shell script be different?

Because the shell exposes the locale settings in a different way, offering
the user more flexibility?


   | noting that LC_ALL is set as a shell variable and making the right call
   | to setlocale() to make sure it overrides LC_CTYPE. I'd argue that having
   | just set LC_ALL, this is what the user expects here.

That's something like what I tried in my first attempt to implement
this, but it turns out not to really work very well.

I think it works just fine, and bash users expect this behavior.

Certainly if
the user actually sets LC_ALL that is the effect it should have, but
if LC_ALL was set sometime much earlier (perhaps weeks earlier in an
interactive shell) what do you believe is the intent if that user later
sets LC_CTYPE (or any of them) - or if some script that is run does that?

I believe that users have agency, and that bash should trust that they know
what they're doing.

I know setting LC_ALL is something of a sledge hammer operation, which
isn't often appropriate, but it is also a fairly common example that
people copy (it avoids needing to work out which specific category
is affecting some operation or other).

OK. There's plenty of copypasta that has tripped up users before. Can the
shell protect them from that?


   | You're assuming a certain behavior and going on from there. The shell
   | doesn't have to do it that way.

Again, no it doesn't, but it seemed to me when I tried it, that this
way gives the most desirable outcome.

   | I'd argue that the shell should modify the locale categories that affect
   | its behavior.

How do we know which they are?   That is, locale settings can affect
libc operations in some cases, and if we're writing portable code
(which bash at least attempts to do, the NetBSD shell a little less
so) can you really be sure that some libc function that is being
used won't be affected by a locale setting that you've never heard of?

You really can't. All you can do is document the locale categories you
use, and which ones you modify based on shell variables. If the user sets
an environmnent variable, say, LC_NAME (one of the GNU libc extensions),
that's not going to affect the shell's behavior because the shell won't
call setlocale(LC_NAME, whatever). If LC_NAME is in the environment when
the shell starts, and setlocale(LC_ALL, "") pays attention to it, I'd
argue that that's the user's intent.


That's no issue at all if the shell just does setlocale(LC_ALL, "");
as part of its init sequence (or nothing at all related to locales,
and gets C everywhere) but if you start modifying some of them, but not
others, how well will that really work?

Well enough that I don't get bug reports about it. Users and shell
programmers can get done what they need.


   | That's a tricky business, no doubt, but it bounds the
   | effects (or you could just pay attention to all the categories that POSIX
   | defines).

That would certainly limit things, and is what my current implementation
does - if I were confident enough, I could drop the ones (from being
explicitly operated upon by the shell, not from being set/exported like
any other var) that I have no reason to believe the shell itself ever
cares about (like LC_MONETARY for example) but that is a bit of a gamble.

I don't modify locale settings based on $LC_MONETARY.


   | Plus there's nothing in POSIX that I can see that allows locale
   | definitions to add additional categories.

Lawrence already supplied the references for that (and my thanks for
saving me from needing to do that) and also showed that implementations
do actually create more of them (most of which probably are irrelevant
to the workings of the shell, but I can't be certain, and I most certainly
don't know that of other, unknown to me, extra locale categories).

Yes, glibc, at least, adds some new LC_ variables, but it appears to be an
outlier here.

   | Since LC_ALL is being unset, you can go through all the locale categories
   | you know about and set them appropriately. If it's one of the other LC_
   | variables being unset, you can just change that one.

Yes, that's what I am doing, is that what bash does?

Yes, more or less.


On the other hand, if I recall correctly (and I might not, the actual
code for this was from late last year) if LANG gets unset, then nothing
gets updated, even categories that only got a value from the LANG setting.
I think there was a reason for that, but just now, I don't remember it.

I'm not sure why you'd ignore changes to LANG, especially if LC_ALL is not
set.


   | > Further does it make any difference if these vars are being set in
   | > the shell, but not exported?
   | I'd argue that the user wants to change the shell's behavior.

I'd agree.

   | > Also, please consider what built-in utilities are supposed to do
   | > with all of this.  Do those just use the shell's locale settings?
   |
   | Yes, they are builtins and documented as such.

They are, but they're still supposed to behave the same way as the
non-builtin version, at least to the extent that an external version
can operate given that the external version is unable to push changes
back into the shell environment.   But ignoring those, the two should
operate the same way.

I'm not that dogmatic. If printf, for instance, is a builtin, it should
operate using the shell's execution environment, not just pay attention
to exported variables. This isn't just an issue with LC_*, the bash printf
pays attention to the shell's TZ variable also.


so if I do

        unset LC_ALL            # do not want it, and certainly not exported
        unset LC_CTYPE

        LC_CTYPE=en_AU.UTF-8

        printf ....

that printf should ignore the LC_CTYPE setting, which hasn't been
exported to it.

Nope, bash doesn't do it that way, and I'd argue that users expect the
bash behavior.



As above, I don't really think limiting the way sh code can work to be
more limited than what code written in other languages can do is what
users really want, so I don't think a rule like "if LC_ALL is set, even
if not exported, it overrides LC_everything_else (even if those are
exported)" for the shell itself is appropriate, and if LC_ALL is not
exported, certainly not for builtin commands.

OK, we're going to disagree here.



   | > If the builtins should act just as if they were external commands,
   | > and given a major purpose of being builtin is to avoid forking
   | > (and some operations of builtins cannot work if they do fork)
   | > then how is the shell intended to save and restore its locale
   | > environment so the builtin can set its own?
   |
   | I say you don't bother. Users expect the variables they set in a shell
   | session to affect that shell session.

You're saying that builtins should simply use shell variables,
ignoring whether they're exported or not, and so act differently
than an external version of the command would act ?

Yes.


   | I'd say that's up to the implementation. (And are you really saying that
   | these hypothetical extra digits affect `break's treatment of its
   | argument as a "positive decimal integer?")

break's treatment of the value, no, of course not.   Whether break
recognises the values as digits, then yes, though not what I am
"saying", what I am asking.   If we're implmenting the locales properly
(which the NetBSD shell certainly does not, even with my current
uncommitted changes) then it should recognise them.

        If I write

                break ๒

in a bash script with LANG=th set in the environment (and no other
locale settings) is bash going to interpret that correctly?

It depends on what libc does. Bash just uses strtoimax/strtol as
appropriate.

The Thai digits are 0E50 .. 0E59 (which have all the properties
of regular Arabic numerals that we're all used to), from the
NetBSD shell (bash can't execute this one, yet anyway):

Does libc understand those as digits? Does strtoimax() use isdigit() and
understand how to convert those to a numeric value?

As an additional example of where locales can affect results,
in Thai, the digits collate after the alphabetic chars, not before
as they do in latin character sets (including ASCII) and there
is no upper/lower case distinction for alphas, no concept of
character case at all.

So pathname expansion should use strcoll().


   | See above. It's easy to explain to a user that setting LC_ALL overrides
   | everything else.

Yes, it is easy explain that, but is it as easy to justify, given that
in other languages, it is the most recent set that works, as that's how
setlocale(3) behaves, which is the underlying interface into the
locale system?

There are examples either way; it all depends on what you mean by "other
languages." python lets you set LC_ALL and override all the locale
categories.

How do you justify that shell code should behave
differently?

Why do I have to? Why is it not sufficient that the shell does what it does
for the benefit of its users? I don't think that the highest priority is
that the shell should do things like other programs; if you think so, we'll
disagree on that as well.

  One might argue that the only time LC_ALL is intended
to override the other settings, is when C code (or equiv) does
        setlocale(LC_ALL, "");
which is when the locale system pays attention to what is in the
environment.  That's generally a startup time operation only.

OK.


ps: one more obscure question about how bash handles locales.

POSIX says that if the LC_MESSAGES category is changed, after
message catalogs are opened, the change to the env var doesn't
necessarily affect those already open message catalogs.  Assuming
bash has had a need to run strerror() (or similar) and so has
the <errno.h> translations catalog open, and the user then
sets LC_MESSAGES because they got English versions of the error msg,
and they want French, does bash handle that somehow?   If so, how?

It relies on the C library. For instance, using glibc on RHEL9, given
this script:

$ cat x12
LC_MESSAGES=de_DE.UTF-8
. nosuchfile
LC_MESSAGES=fr_FR.UTF-8
. nosuchfile

you get

$ ./bash ./x12
./x12: line 2: nosuchfile: Datei oder Verzeichnis nicht gefunden
./x12: line 4: nosuchfile: Aucun fichier ou dossier de ce type

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    c...@case.edu    http://tiswww.cwru.edu/~chet/

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature

Reply via email to