Re: Suggestion of wording for portion of man page

Greg Wooledge Fri, 17 Jan 2025 04:56:40 -0800

On Thu, Jan 16, 2025 at 21:16:47 -0800, Wiley Young wrote:
> Well, however an amendment might occur, "Each character in the expanded
> value of parameter is tested against pattern" to my ear reads as referring
> to alphabetic characters, as per sentence one, however they may appear in
> binary.


Characters don't "appear in binary".  Characters are encoded as one or
more bytes based on your locale.

hobbit:~$ var='garçon'
hobbit:~$ echo "${var^^}"
GARÇON

Here, the lowercase ç (c with cedilla) is capitalized to Ç (C with cedilla).
This is because ç is considered an alphabetic character in my locale,
and therefore is eligible for capitalization if it matches the pattern.

> But sentence six is an emphatic concern if you ask me, for how "character"
> reads as "alphabetic character". Also how sentences five and six can be
> read as contradictory.

I really don't understand how you're reading this text.  Let's start by
actually SHOWING the text in the email, so everyone can follow along.

  1: This expansion modifies the case of alphabetic characters in parameter.
  2: The pattern is expanded to  produce a pattern just as in pathname
     expansion.
  3: Each character in the expanded value of parameter is tested against
     pattern,  and, if  it  matches the pattern, its case is converted.
  4: The pattern should not attempt to match more than one character.
  5: The ^  operator converts lowercase letters matching pattern to uppercase;
     the , operator converts matching uppercase letters to lowercase.
  6: The  ^^  and ,, expansions convert each matched character in the
     expanded value; the ^ and , expansions match  and  convert  only
     the  first character in the expanded value.

Sentence 1 gives a general sense of what the expansion does, and it tells
us that the only characters that will be modified are alphabetic ones.

Sentence 2 tells us that the pattern works like a glob.

Sentence 3 tells us that bash iterates over "each character" (i.e. ALL
the characters) in the parameter, and tests each one individually
against the pattern.

Sentence 4 tells us that the pattern should only match a single character.

Sentence 5 tells us the difference between the ^ and , operators.

Sentence 6 tells us the difference between ^ and ^^ and the difference
between , and ,, operators.

I think the man page is reasonably clear, albeit a bit awkward.

> So ${foo^x} is a PE dedicated to case modification that tests the first
> character of the string that $foo expands into, whatever type of character
> that might be.

${foo^x} will capitalize (because ^ not ,) the first character
(because ^ not ^^) of the parameter foo but only if that character
(1) is alphabetic, and (2) matches the pattern "x".

> But why would it not alter the first alphabetic character if
> the first character in the string is not alphabetic?

Because that's not how it works.  It only tests the first character of
the expanded parameter.  It says so in sentence 6.

> That the first
> character in a string would be an alphabetic one is an assumption, one that
> seems to be in use.

No.  Only alphabetic characters are eligible for modification.  It says
so in sentence 1.

> Neither assumption - that the first character would be
> an alphabetic one, or that the algorithm would look for the first
> alphabetic character - is clearly stated.

Sentence 1 says that it must be alphabetic.  Sentence 6 says that only
the first character is looked at (for , and ^).

> "First character in the expanded value," means something like "the group of
> bytes that prints at place zero as counted from the beginning of the entire
> byte string..." which is the result of the expansion of 'parameter.'

Since 1990 or so, "character" has stopped meaning "byte".  (That used
to be how things were defined in C in 1988 and earlier.)

Now, a "character" is an abstraction.  It refers to a single value from
any kind of human writing system.  In English, characters include letters,
numerals, punctuation marks, and so on.

In order to store characters in a computer, each character must be
represented by a number.  The mapping from characters to numbers (or
vice versa) is called a character encoding.  Different locales use
different character encoding rules.

If your locale is "C" or "POSIX", then that character encoding rule is
"each byte is one character" (1988 rules).

If your locale is anything else, then it includes a character encoding
component, e.g. "utf8".  This encoding defines how a string may be
divided into characters.  Each utf8 character may be anywhere from 1 to 6
bytes long.

So, what you said up above is kind of correct, but it's apparent that
you are really struggling to understand the modern definition of
"character".  I think you're beginning to get it, though.

In most modern scripting languages, when a string is read from an
external source (e.g. a file or an environment variable), it gets
converted into an internal representation that is easy to work with
(some fixed number of bytes per character).  Then, when the string is
written back to any outside source, it gets converted back, from the
internal representation to whatever character encoding rule is in play.

Bash, however, doesn't do that.  It stores everything in raw form, and
each tool (including internal ones like parameter expansion) is
responsible for applying the character encoding rules to figure out
where characters begin and end.  (This is *one* of the several reasons
bash is so slow when processing large amounts of data.)

The end result is the same, however.  If I have a bash variable that
holds the string "étude", then that string is 5 characters long.
It doesn't matter whether the character encoding is utf8 (where the é
is encoded using 2 bytes), or latin1 (where the é is encoded in 1 byte).
Bash is required to handle it correctly in either locale.

As far as Case modification is concerned, that é is one character.
It is alphabetic (according to the rules of my locale), and therefore
if the case modification operator is ^ or , the é character gets
matched against a pattern to determine whether it should be modified,
but no other characters are checked.  If the operator is ^^ or ,, then
all 5 characters are checked.

It doesn't matter whether I produced the string étude by reading it
from a file, or by typing Compose + e + ' when I wrote the script,
or by using the \u00e9 sequence in the script and asking bash to expand
it for me.  The end result is the same 5-character string, where one
of those characters can't be represented in the C locale, and may be
encoded with 1 byte in Latin-1 or 2 bytes in UTF-8.

Re: Suggestion of wording for portion of man page

Reply via email to