On 3/30/25 2:34 AM, Robert Elz wrote:
Date: Thu, 27 Mar 2025 17:22:03 -0400 From: Chet Ramey <chet.ra...@case.edu> Message-ID: <6da17a73-2aac-4fa5-9fa7-5bfff087d...@case.edu>| The shell should assume that setting a shell variable means the | user wants to modify the shell's locale settings. Yes, of course, the question is exactly what the user wants &/or what the shell should impose on the user. | > One answer would be "nothing" | A bad choice. I agree, though that would make the shell operate the way just about all other programs behave, for example, if one does ENVIRON["LC_NUMERIC"] = "whatever" in awk, while that should appear in the environment of something awk runs for a system() operation, it doesn't affect the way awk works, or its numeric input or output, any way at all. | The shell does quite a lot of things that are different than any other | application, including allowing the user to change locale environment | variables. Yes, but it is not entirely alone in that. awk/make/env/... allow the environment to be altered, but none of them notice that it is a locale environment variable that is being altered, and adapt their own behaviour to match, or none that I'm aware of, so having shells do the same would not necessarily be all that remarkable, even if it is perhaps sub-optimal.
That's obviously an implementation choice, and can happen to various degrees. For instance, GNU awk understands the TEXTDOMAIN variable and provides builtin functions for translating messages, but doesn't look at LC_* or LANG except at startup. There's not really much in make that would be modified based on changing locale variables, and nothing at all in env. However, languages like, say, python that allow users to set variables do offer methods to change the interpreter behavior based on programmers setting variables or calling builtin locale functions.
| That doesn't have to be all the shell does. The precedence hierarchy is | well-understood; there's nothing stopping the shell from implementing it: No there's not, but I'm not sure that's the ideal result. For example, if a script does LC_CTYPE=C (or something similar) should it break if the user happened to have set LC_ALL=something-different in the environment?
What does `break' mean?
If any other program does setlocale(LC_CTYPE, "C"); that has effect for any future operations, regardless of what might have been in the environment, why should a shell script be different?
Because the shell exposes the locale settings in a different way, offering the user more flexibility?
| noting that LC_ALL is set as a shell variable and making the right call | to setlocale() to make sure it overrides LC_CTYPE. I'd argue that having | just set LC_ALL, this is what the user expects here. That's something like what I tried in my first attempt to implementthis, but it turns out not to really work very well.
I think it works just fine, and bash users expect this behavior.
Certainly if the user actually sets LC_ALL that is the effect it should have, but if LC_ALL was set sometime much earlier (perhaps weeks earlier in an interactive shell) what do you believe is the intent if that user later sets LC_CTYPE (or any of them) - or if some script that is run does that?
I believe that users have agency, and that bash should trust that they know what they're doing.
I know setting LC_ALL is something of a sledge hammer operation, which isn't often appropriate, but it is also a fairly common example that people copy (it avoids needing to work out which specific category is affecting some operation or other).
OK. There's plenty of copypasta that has tripped up users before. Can the shell protect them from that?
| You're assuming a certain behavior and going on from there. The shell | doesn't have to do it that way. Again, no it doesn't, but it seemed to me when I tried it, that this way gives the most desirable outcome. | I'd argue that the shell should modify the locale categories that affect | its behavior. How do we know which they are? That is, locale settings can affect libc operations in some cases, and if we're writing portable code (which bash at least attempts to do, the NetBSD shell a little less so) can you really be sure that some libc function that is being used won't be affected by a locale setting that you've never heard of?
You really can't. All you can do is document the locale categories you use, and which ones you modify based on shell variables. If the user sets an environmnent variable, say, LC_NAME (one of the GNU libc extensions), that's not going to affect the shell's behavior because the shell won't call setlocale(LC_NAME, whatever). If LC_NAME is in the environment when the shell starts, and setlocale(LC_ALL, "") pays attention to it, I'd argue that that's the user's intent.
That's no issue at all if the shell just does setlocale(LC_ALL, ""); as part of its init sequence (or nothing at all related to locales, and gets C everywhere) but if you start modifying some of them, but not others, how well will that really work?
Well enough that I don't get bug reports about it. Users and shell programmers can get done what they need.
| That's a tricky business, no doubt, but it bounds the | effects (or you could just pay attention to all the categories that POSIX | defines). That would certainly limit things, and is what my current implementation does - if I were confident enough, I could drop the ones (from being explicitly operated upon by the shell, not from being set/exported like any other var) that I have no reason to believe the shell itself ever cares about (like LC_MONETARY for example) but that is a bit of a gamble.
I don't modify locale settings based on $LC_MONETARY.
| Plus there's nothing in POSIX that I can see that allows locale | definitions to add additional categories. Lawrence already supplied the references for that (and my thanks for saving me from needing to do that) and also showed that implementations do actually create more of them (most of which probably are irrelevant to the workings of the shell, but I can't be certain, and I most certainly don't know that of other, unknown to me, extra locale categories).
Yes, glibc, at least, adds some new LC_ variables, but it appears to be an outlier here.
| Since LC_ALL is being unset, you can go through all the locale categories | you know about and set them appropriately. If it's one of the other LC_ | variables being unset, you can just change that one. Yes, that's what I am doing, is that what bash does?
Yes, more or less.
On the other hand, if I recall correctly (and I might not, the actual code for this was from late last year) if LANG gets unset, then nothing gets updated, even categories that only got a value from the LANG setting. I think there was a reason for that, but just now, I don't remember it.
I'm not sure why you'd ignore changes to LANG, especially if LC_ALL is not set.
| > Further does it make any difference if these vars are being set in | > the shell, but not exported? | I'd argue that the user wants to change the shell's behavior. I'd agree. | > Also, please consider what built-in utilities are supposed to do | > with all of this. Do those just use the shell's locale settings? | | Yes, they are builtins and documented as such. They are, but they're still supposed to behave the same way as the non-builtin version, at least to the extent that an external version can operate given that the external version is unable to push changes back into the shell environment. But ignoring those, the two should operate the same way.
I'm not that dogmatic. If printf, for instance, is a builtin, it should operate using the shell's execution environment, not just pay attention to exported variables. This isn't just an issue with LC_*, the bash printf pays attention to the shell's TZ variable also.
so if I do unset LC_ALL # do not want it, and certainly not exported unset LC_CTYPE LC_CTYPE=en_AU.UTF-8 printf .... that printf should ignore the LC_CTYPE setting, which hasn't beenexported to it.
Nope, bash doesn't do it that way, and I'd argue that users expect the bash behavior.
As above, I don't really think limiting the way sh code can work to be more limited than what code written in other languages can do is what users really want, so I don't think a rule like "if LC_ALL is set, even if not exported, it overrides LC_everything_else (even if those are exported)" for the shell itself is appropriate, and if LC_ALL is not exported, certainly not for builtin commands.
OK, we're going to disagree here.
| > If the builtins should act just as if they were external commands, | > and given a major purpose of being builtin is to avoid forking | > (and some operations of builtins cannot work if they do fork) | > then how is the shell intended to save and restore its locale | > environment so the builtin can set its own? | | I say you don't bother. Users expect the variables they set in a shell | session to affect that shell session. You're saying that builtins should simply use shell variables, ignoring whether they're exported or not, and so act differently than an external version of the command would act ?
Yes.
| I'd say that's up to the implementation. (And are you really saying that | these hypothetical extra digits affect `break's treatment of its | argument as a "positive decimal integer?") break's treatment of the value, no, of course not. Whether break recognises the values as digits, then yes, though not what I am "saying", what I am asking. If we're implmenting the locales properly (which the NetBSD shell certainly does not, even with my current uncommitted changes) then it should recognise them. If I write break ๒ in a bash script with LANG=th set in the environment (and no other locale settings) is bash going to interpret that correctly?
It depends on what libc does. Bash just uses strtoimax/strtol as appropriate.
The Thai digits are 0E50 .. 0E59 (which have all the properties of regular Arabic numerals that we're all used to), from the NetBSD shell (bash can't execute this one, yet anyway):
Does libc understand those as digits? Does strtoimax() use isdigit() and understand how to convert those to a numeric value?
As an additional example of where locales can affect results, in Thai, the digits collate after the alphabetic chars, not before as they do in latin character sets (including ASCII) and there is no upper/lower case distinction for alphas, no concept of character case at all.
So pathname expansion should use strcoll().
| See above. It's easy to explain to a user that setting LC_ALL overrides | everything else. Yes, it is easy explain that, but is it as easy to justify, given that in other languages, it is the most recent set that works, as that's how setlocale(3) behaves, which is the underlying interface into thelocale system?
There are examples either way; it all depends on what you mean by "other languages." python lets you set LC_ALL and override all the locale categories.
How do you justify that shell code should behavedifferently?
Why do I have to? Why is it not sufficient that the shell does what it does for the benefit of its users? I don't think that the highest priority is that the shell should do things like other programs; if you think so, we'll disagree on that as well. One might argue that the only time LC_ALL is intended
to override the other settings, is when C code (or equiv) does setlocale(LC_ALL, ""); which is when the locale system pays attention to what is in the environment. That's generally a startup time operation only.
OK.
ps: one more obscure question about how bash handles locales. POSIX says that if the LC_MESSAGES category is changed, after message catalogs are opened, the change to the env var doesn't necessarily affect those already open message catalogs. Assuming bash has had a need to run strerror() (or similar) and so has the <errno.h> translations catalog open, and the user then sets LC_MESSAGES because they got English versions of the error msg, and they want French, does bash handle that somehow? If so, how?
It relies on the C library. For instance, using glibc on RHEL9, given this script: $ cat x12 LC_MESSAGES=de_DE.UTF-8 . nosuchfile LC_MESSAGES=fr_FR.UTF-8 . nosuchfile you get $ ./bash ./x12 ./x12: line 2: nosuchfile: Datei oder Verzeichnis nicht gefunden ./x12: line 4: nosuchfile: Aucun fichier ou dossier de ce type -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRU c...@case.edu http://tiswww.cwru.edu/~chet/
OpenPGP_signature.asc
Description: OpenPGP digital signature