Date: Thu, 27 Mar 2025 17:22:03 -0400 From: Chet Ramey <chet.ra...@case.edu> Message-ID: <6da17a73-2aac-4fa5-9fa7-5bfff087d...@case.edu>
| The shell should assume that setting a shell variable means the | user wants to modify the shell's locale settings. Yes, of course, the question is exactly what the user wants &/or what the shell should impose on the user. | > One answer would be "nothing" | A bad choice. I agree, though that would make the shell operate the way just about all other programs behave, for example, if one does ENVIRON["LC_NUMERIC"] = "whatever" in awk, while that should appear in the environment of something awk runs for a system() operation, it doesn't affect the way awk works, or its numeric input or output, any way at all. | The shell does quite a lot of things that are different than any other | application, including allowing the user to change locale environment | variables. Yes, but it is not entirely alone in that. awk/make/env/... allow the environment to be altered, but none of them notice that it is a locale environment variable that is being altered, and adapt their own behaviour to match, or none that I'm aware of, so having shells do the same would not necessarily be all that remarkable, even if it is perhaps sub-optimal. | That doesn't have to be all the shell does. The precedence hierarchy is | well-understood; there's nothing stopping the shell from implementing it: No there's not, but I'm not sure that's the ideal result. For example, if a script does LC_CTYPE=C (or something similar) should it break if the user happened to have set LC_ALL=something-different in the environment? If any other program does setlocale(LC_CTYPE, "C"); that has effect for any future operations, regardless of what might have been in the environment, why should a shell script be different? | noting that LC_ALL is set as a shell variable and making the right call | to setlocale() to make sure it overrides LC_CTYPE. I'd argue that having | just set LC_ALL, this is what the user expects here. That's something like what I tried in my first attempt to implement this, but it turns out not to really work very well. Certainly if the user actually sets LC_ALL that is the effect it should have, but if LC_ALL was set sometime much earlier (perhaps weeks earlier in an interactive shell) what do you believe is the intent if that user later sets LC_CTYPE (or any of them) - or if some script that is run does that? I know setting LC_ALL is something of a sledge hammer operation, which isn't often appropriate, but it is also a fairly common example that people copy (it avoids needing to work out which specific category is affecting some operation or other). | You're assuming a certain behavior and going on from there. The shell | doesn't have to do it that way. Again, no it doesn't, but it seemed to me when I tried it, that this way gives the most desirable outcome. | I'd argue that the shell should modify the locale categories that affect | its behavior. How do we know which they are? That is, locale settings can affect libc operations in some cases, and if we're writing portable code (which bash at least attempts to do, the NetBSD shell a little less so) can you really be sure that some libc function that is being used won't be affected by a locale setting that you've never heard of? That's no issue at all if the shell just does setlocale(LC_ALL, ""); as part of its init sequence (or nothing at all related to locales, and gets C everywhere) but if you start modifying some of them, but not others, how well will that really work? | That's a tricky business, no doubt, but it bounds the | effects (or you could just pay attention to all the categories that POSIX | defines). That would certainly limit things, and is what my current implementation does - if I were confident enough, I could drop the ones (from being explicitly operated upon by the shell, not from being set/exported like any other var) that I have no reason to believe the shell itself ever cares about (like LC_MONETARY for example) but that is a bit of a gamble. | Plus there's nothing in POSIX that I can see that allows locale | definitions to add additional categories. Lawrence already supplied the references for that (and my thanks for saving me from needing to do that) and also showed that implementations do actually create more of them (most of which probably are irrelevant to the workings of the shell, but I can't be certain, and I most certainly don't know that of other, unknown to me, extra locale categories). | Note that one of the LC_ variables is being unset and act appropriately? Yes, that's more or less what my current (uncommitted) code does. But it is all a bit of a mess, and not easy to get right. | Since LC_ALL is being unset, you can go through all the locale categories | you know about and set them appropriately. If it's one of the other LC_ | variables being unset, you can just change that one. Yes, that's what I am doing, is that what bash does? On the other hand, if I recall correctly (and I might not, the actual code for this was from late last year) if LANG gets unset, then nothing gets updated, even categories that only got a value from the LANG setting. I think there was a reason for that, but just now, I don't remember it. | > Further does it make any difference if these vars are being set in | > the shell, but not exported? | I'd argue that the user wants to change the shell's behavior. I'd agree. | > Also, please consider what built-in utilities are supposed to do | > with all of this. Do those just use the shell's locale settings? | | Yes, they are builtins and documented as such. They are, but they're still supposed to behave the same way as the non-builtin version, at least to the extent that an external version can operate given that the external version is unable to push changes back into the shell environment. But ignoring those, the two should operate the same way. so if I do unset LC_ALL # do not want it, and certainly not exported unset LC_CTYPE LC_CTYPE=en_AU.UTF-8 printf .... that printf should ignore the LC_CTYPE setting, which hasn't been exported to it. Just as ( exec printf .... ) would do. That printf should be doing something like setlocale(LC_ALL, "") and so should the built-in one, which should result in whatever is in the environment (which we know here does not include LC_ALL or LC_CTYPE) as the locale settings (maybe LANG is set in the environment, and printf's LC_CTYPE will come from that). | Yes, if the user sets it up this way, that is what will happen, but it's | unlikely. It is unlikely to the extent that few scripts bother to do anything with the environment at all (and for that reason, many are broken, anything which uses a range in a [] in a pattern should usually care about what the environment is, as those don't always work as expected otherwise). | The value of LC_ALL should be used, since its precedecnce is higher. Yes, I'd agree, but that means either implementing it that way for the whole shell, or changing the locale (there is, in general, just one, other than for code using the kind of sporadic _l versions of some libc functions) of the shell before running printf, and then restoring it after. As above, I don't really think limiting the way sh code can work to be more limited than what code written in other languages can do is what users really want, so I don't think a rule like "if LC_ALL is set, even if not exported, it overrides LC_everything_else (even if those are exported)" for the shell itself is appropriate, and if LC_ALL is not exported, certainly not for builtin commands. | > Including if I explicitly do: | > LC_NUMERIC=xxx printf ... | > (with LC_ALL still set in the environment). | | Nope. Even the external version of the command would have LC_ALL | override the temporary assignment to LC_NUMERIC. That's obvious, there's no "even" there, the question is how the internal (built-in) printf behaves, actual external commands are trivial to understand by comparison. A better example for this though is where LC_ALL is set in the shell but not exported. | > If the builtins should act just as if they were external commands, | > and given a major purpose of being builtin is to avoid forking | > (and some operations of builtins cannot work if they do fork) | > then how is the shell intended to save and restore its locale | > environment so the builtin can set its own? | | I say you don't bother. Users expect the variables they set in a shell | session to affect that shell session. You're saying that builtins should simply use shell variables, ignoring whether they're exported or not, and so act differently than an external version of the command would act ? | I'd say that's up to the implementation. (And are you really saying that | these hypothetical extra digits affect `break's treatment of its | argument as a "positive decimal integer?") break's treatment of the value, no, of course not. Whether break recognises the values as digits, then yes, though not what I am "saying", what I am asking. If we're implmenting the locales properly (which the NetBSD shell certainly does not, even with my current uncommitted changes) then it should recognise them. If I write break ๒ in a bash script with LANG=th set in the environment (and no other locale settings) is bash going to interpret that correctly? That's one of the "hypothetical extra digits" that really do exist (it is $'\u0E52' if the character isn't represented properly in someone's mailer). The Thai digits are 0E50 .. 0E59 (which have all the properties of regular Arabic numerals that we're all used to), from the NetBSD shell (bash can't execute this one, yet anyway): echo {$'\u0e50'..$'\u0e59'} ๐ ๑ ๒ ๓ ๔ ๕ ๖ ๗ ๘ ๙ which mean the same as 0..9. (For anyone about to race out to check this form of brace expansion in the NetBSD shell, or brace expansion there at all, don't. The code for that is also not yet committed, and I'm not sure it ever will be). As an additional example of where locales can affect results, in Thai, the digits collate after the alphabetic chars, not before as they do in latin character sets (including ASCII) and there is no upper/lower case distinction for alphas, no concept of character case at all. | See above. It's easy to explain to a user that setting LC_ALL overrides | everything else. Yes, it is easy explain that, but is it as easy to justify, given that in other languages, it is the most recent set that works, as that's how setlocale(3) behaves, which is the underlying interface into the locale system? How do you justify that shell code should behave differently? One might argue that the only time LC_ALL is intended to override the other settings, is when C code (or equiv) does setlocale(LC_ALL, ""); which is when the locale system pays attention to what is in the environment. That's generally a startup time operation only. And also please note, in this, I am trying to work out what should be done, so that if the locale stuff I currently have implemented does get committed (and those changes are more likely than the braceexpand stuff, as they are in response to a bug report that was submitted) then I can commit (and document) what is generally agreed to be the correct behaviour. I am asking questions (while also giving my current opinion on what some of the answers might be). kre ps: one more obscure question about how bash handles locales. POSIX says that if the LC_MESSAGES category is changed, after message catalogs are opened, the change to the env var doesn't necessarily affect those already open message catalogs. Assuming bash has had a need to run strerror() (or similar) and so has the <errno.h> translations catalog open, and the user then sets LC_MESSAGES because they got English versions of the error msg, and they want French, does bash handle that somehow? If so, how?