Date:        Thu, 27 Mar 2025 17:22:03 -0400
    From:        Chet Ramey <chet.ra...@case.edu>
    Message-ID:  <6da17a73-2aac-4fa5-9fa7-5bfff087d...@case.edu>

  | The shell should assume that setting a shell variable means the
  | user wants to modify the shell's locale settings.

Yes, of course, the question is exactly what the user wants &/or what
the shell should impose on the user.

  | > One answer would be "nothing"
  | A bad choice.

I agree, though that would make the shell operate the way just
about all other programs behave, for example, if one does

        ENVIRON["LC_NUMERIC"] = "whatever"

in awk, while that should appear in the environment of something
awk runs for a system() operation, it doesn't affect the way awk
works, or its numeric input or output, any way at all.

  | The shell does quite a lot of things that are different than any other
  | application, including allowing the user to change locale environment
  | variables.

Yes, but it is not entirely alone in that.  awk/make/env/... allow the
environment to be altered, but none of them notice that it is a locale
environment variable that is being altered, and adapt their own behaviour
to match, or none that I'm aware of, so having shells do the same would
not necessarily be all that remarkable, even if it is perhaps sub-optimal.

  | That doesn't have to be all the shell does. The precedence hierarchy is
  | well-understood; there's nothing stopping the shell from implementing it:

No there's not, but I'm not sure that's the ideal result.   For example,
if a script does LC_CTYPE=C (or something similar) should it break if the
user happened to have set LC_ALL=something-different in the environment?

If any other program does setlocale(LC_CTYPE, "C"); that has effect
for any future operations, regardless of what might have been in the
environment, why should a shell script be different?

  | noting that LC_ALL is set as a shell variable and making the right call
  | to setlocale() to make sure it overrides LC_CTYPE. I'd argue that having
  | just set LC_ALL, this is what the user expects here.

That's something like what I tried in my first attempt to implement
this, but it turns out not to really work very well.  Certainly if
the user actually sets LC_ALL that is the effect it should have, but
if LC_ALL was set sometime much earlier (perhaps weeks earlier in an
interactive shell) what do you believe is the intent if that user later
sets LC_CTYPE (or any of them) - or if some script that is run does that?

I know setting LC_ALL is something of a sledge hammer operation, which
isn't often appropriate, but it is also a fairly common example that
people copy (it avoids needing to work out which specific category
is affecting some operation or other).

  | You're assuming a certain behavior and going on from there. The shell
  | doesn't have to do it that way.

Again, no it doesn't, but it seemed to me when I tried it, that this
way gives the most desirable outcome.

  | I'd argue that the shell should modify the locale categories that affect
  | its behavior.

How do we know which they are?   That is, locale settings can affect
libc operations in some cases, and if we're writing portable code
(which bash at least attempts to do, the NetBSD shell a little less
so) can you really be sure that some libc function that is being
used won't be affected by a locale setting that you've never heard of?

That's no issue at all if the shell just does setlocale(LC_ALL, "");
as part of its init sequence (or nothing at all related to locales,
and gets C everywhere) but if you start modifying some of them, but not
others, how well will that really work?

  | That's a tricky business, no doubt, but it bounds the
  | effects (or you could just pay attention to all the categories that POSIX
  | defines).

That would certainly limit things, and is what my current implementation
does - if I were confident enough, I could drop the ones (from being
explicitly operated upon by the shell, not from being set/exported like
any other var) that I have no reason to believe the shell itself ever
cares about (like LC_MONETARY for example) but that is a bit of a gamble.

  | Plus there's nothing in POSIX that I can see that allows locale
  | definitions to add additional categories.

Lawrence already supplied the references for that (and my thanks for
saving me from needing to do that) and also showed that implementations
do actually create more of them (most of which probably are irrelevant
to the workings of the shell, but I can't be certain, and I most certainly
don't know that of other, unknown to me, extra locale categories).

  | Note that one of the LC_ variables is being unset and act appropriately?

Yes, that's more or less what my current (uncommitted) code does.
But it is all a bit of a mess, and not easy to get right.

  | Since LC_ALL is being unset, you can go through all the locale categories
  | you know about and set them appropriately. If it's one of the other LC_
  | variables being unset, you can just change that one.

Yes, that's what I am doing, is that what bash does?

On the other hand, if I recall correctly (and I might not, the actual
code for this was from late last year) if LANG gets unset, then nothing
gets updated, even categories that only got a value from the LANG setting.
I think there was a reason for that, but just now, I don't remember it.

  | > Further does it make any difference if these vars are being set in
  | > the shell, but not exported?
  | I'd argue that the user wants to change the shell's behavior.

I'd agree.

  | > Also, please consider what built-in utilities are supposed to do
  | > with all of this.  Do those just use the shell's locale settings?
  |
  | Yes, they are builtins and documented as such.

They are, but they're still supposed to behave the same way as the
non-builtin version, at least to the extent that an external version
can operate given that the external version is unable to push changes
back into the shell environment.   But ignoring those, the two should
operate the same way.

so if I do

        unset LC_ALL            # do not want it, and certainly not exported
        unset LC_CTYPE

        LC_CTYPE=en_AU.UTF-8

        printf ....

that printf should ignore the LC_CTYPE setting, which hasn't been
exported to it.   Just as

        ( exec printf .... )

would do.   That printf should be doing something like

        setlocale(LC_ALL, "")

and so should the built-in one, which should result in whatever
is in the environment (which we know here does not include LC_ALL
or LC_CTYPE) as the locale settings (maybe LANG is set in the
environment, and printf's LC_CTYPE will come from that).

  | Yes, if the user sets it up this way, that is what will happen, but it's
  | unlikely.

It is unlikely to the extent that few scripts bother to do anything with
the environment at all (and for that reason, many are broken, anything
which uses a range in a [] in a pattern should usually care about what
the environment is, as those don't always work as expected otherwise).

  | The value of LC_ALL should be used, since its precedecnce is higher.

Yes, I'd agree, but that means either implementing it that way for the
whole shell, or changing the locale (there is, in general, just one,
other than for code using the kind of sporadic _l versions of some
libc functions) of the shell before running printf, and then restoring
it after.

As above, I don't really think limiting the way sh code can work to be
more limited than what code written in other languages can do is what
users really want, so I don't think a rule like "if LC_ALL is set, even
if not exported, it overrides LC_everything_else (even if those are
exported)" for the shell itself is appropriate, and if LC_ALL is not
exported, certainly not for builtin commands.

  | > Including if I explicitly do:
  | >   LC_NUMERIC=xxx printf ...
  | > (with LC_ALL still set in the environment).
  |
  | Nope. Even the external version of the command would have LC_ALL
  | override the temporary assignment to LC_NUMERIC.

That's obvious, there's no "even" there, the question is how the
internal (built-in) printf behaves, actual external commands are
trivial to understand by comparison.   A better example for this
though is where LC_ALL is set in the shell but not exported.

  | > If the builtins should act just as if they were external commands,
  | > and given a major purpose of being builtin is to avoid forking
  | > (and some operations of builtins cannot work if they do fork)
  | > then how is the shell intended to save and restore its locale
  | > environment so the builtin can set its own? 
  |
  | I say you don't bother. Users expect the variables they set in a shell
  | session to affect that shell session.

You're saying that builtins should simply use shell variables,
ignoring whether they're exported or not, and so act differently
than an external version of the command would act ?

  | I'd say that's up to the implementation. (And are you really saying that
  | these hypothetical extra digits affect `break's treatment of its
  | argument as a "positive decimal integer?")

break's treatment of the value, no, of course not.   Whether break
recognises the values as digits, then yes, though not what I am
"saying", what I am asking.   If we're implmenting the locales properly
(which the NetBSD shell certainly does not, even with my current
uncommitted changes) then it should recognise them.

        If I write

                break ๒

in a bash script with LANG=th set in the environment (and no other
locale settings) is bash going to interpret that correctly?

That's one of the "hypothetical extra digits" that really do exist
(it is $'\u0E52' if the character isn't represented properly in
someone's mailer).

The Thai digits are 0E50 .. 0E59 (which have all the properties
of regular Arabic numerals that we're all used to), from the
NetBSD shell (bash can't execute this one, yet anyway):

        echo {$'\u0e50'..$'\u0e59'}
        ๐ ๑ ๒ ๓ ๔ ๕ ๖ ๗ ๘ ๙

which mean the same as 0..9.   (For anyone about to race out
to check this form of brace expansion in the NetBSD shell, or
brace expansion there at all, don't.  The code for that is also
not yet committed, and I'm not sure it ever will be).

As an additional example of where locales can affect results,
in Thai, the digits collate after the alphabetic chars, not before
as they do in latin character sets (including ASCII) and there
is no upper/lower case distinction for alphas, no concept of
character case at all.

  | See above. It's easy to explain to a user that setting LC_ALL overrides
  | everything else.

Yes, it is easy explain that, but is it as easy to justify, given that
in other languages, it is the most recent set that works, as that's how
setlocale(3) behaves, which is the underlying interface into the
locale system?   How do you justify that shell code should behave
differently?   One might argue that the only time LC_ALL is intended
to override the other settings, is when C code (or equiv) does
        setlocale(LC_ALL, "");
which is when the locale system pays attention to what is in the
environment.  That's generally a startup time operation only.

And also please note, in this, I am trying to work out what
should be done, so that if the locale stuff I currently have
implemented does get committed (and those changes are more likely
than the braceexpand stuff, as they are in response to a bug report
that was submitted) then I can commit (and document) what is
generally agreed to be the correct behaviour.   I am asking
questions (while also giving my current opinion on what some
of the answers might be).

kre

ps: one more obscure question about how bash handles locales.

POSIX says that if the LC_MESSAGES category is changed, after
message catalogs are opened, the change to the env var doesn't
necessarily affect those already open message catalogs.  Assuming
bash has had a need to run strerror() (or similar) and so has
the <errno.h> translations catalog open, and the user then
sets LC_MESSAGES because they got English versions of the error msg,
and they want French, does bash handle that somehow?   If so, how?


Reply via email to