Eduardo A. Bustamante López wrote:
This definition (
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html#tag_03_230
) states: 3.230 Name In the shell command language, a word consisting solely of underscores,digits, and alphabetics from the portable character set. The first
character
of a name is not a digit.
---- (1) -- It appears you /accidently/ left out part of the text under section 3.230. The full text:
[§̲⁽¹⁾ http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap06.html#tag_06_01 ]3.230 Name In the shell command language, a word consisting solely of underscores, digits, and alphabetics from the portable character set. The first character of a name is not a digit. Note: The Portable Character Set is defined in detail in P̲o̲r̲t̲a̲b̲l̲e̲ ̲C̲h̲a̲r̲a̲c̲t̲e̲r̲ ̲S̲e̲t̲⁽¹⁾
3.231 ...[next section]
---- Thank-you. This slightly clarifies matters as it only requires the POSIX source. At the location pointed to by the hyper-link for "Portable Character Set" under section 6.1 sentences 2-4, it states:
Each supported locale shall include the portable character set, which is the set of symbolic names for characters in Portable Character Set. This is used to describe characters within the text of IEEE Std 1003.1-2001. The first eight entries in Portable Character Set are defined in the ISO/IEC 6429:1992 standard and the rest of the characters are defined in the ISO/IEC 10646-1:2000standard.
---- FWIW, in full disclosure, in the last dotted paragraph before the last sentence of section 6.1, there is a requirement that the alphabetic character fit within 1 byte -- i.e. only characters in what is commonly called the "Extended ASCII character set" (ex. ISO-8859-1) seem to be required. Note, the character 'Ø' is 1 byte. So, as the quoted section above mentions using [basically], the Unicode table for "symbolic names", it doesn't prescribe a specific encoding. I.e. - While the reference is to ISO-10646 (Unicode), it does not require aspecific encoding.
For Unicode values 0-255, ISO-8859-1 encodes the first 256 bytes of Unicode with 1 byte (satisfying the 1-byte posix constraint, though it is not able to encode Unicode values >=256, which makes posix's reference to ISO-10646 somewhat specious as only the 1st 256 values can be encoded in 1 byte (that I am aware of). Nevertheless, the symbolic name "LATIN CAPITAL LETTER O WITH STROKE (o slash)" or 'U+00D8' is classified as an alphabetic, which is a subsetof the "alphanumeric" requirement of POSIX.
Note under section 9.3.5 "RE Bracket Expression", subsection 6:
The following character class expressions shall be supported in all locales: [:alnum:] [:cntrl:] [:lower:] [:space:] [:alpha:] [:digit:] [:print:] [:upper:] [:blank:] [:graph:] [:punct:] [:xdigit:] In addition, character class expressions of the form: [:name:] are recognized in those locales where the name keyword has been given a charclass definition in the LC_CTYPE category.
Note that "aØb" is classified as fully "alphabetic" by bash's character-class matching facility -- whether in UTF-8 or ISO-8859-1:
echo $LC_CTYPE
en_US.ISO-8859-1 LC_CTYPE=en_US.UTF-8 ...
declare -xg a=$(echo -n $'\x61\xd8\x62')
declare -xg b=${a}c
[[ $a =~ ^[[:alpha:]]+$ ]] && echo alpha
alpha
[[ $a =~ ^[[:alnum:]]+$ ]] && echo alnum
alnum
[[ $b =~ ^[[:alpha:]]+$ ]] && echo alpha
alpha
[[ $b =~ ^[[:alnum:]]+$ ]] && echo alnum
alnum ---- Notice bash classifies the string "aØb" as an alphanumeric AND as an alphabetic character. I.e. bash, itself, claims that "aØb" is a valid identifier. Also note, it accepts "aØb" as a var and as an environment var when used indirectly:
declare -xg $a='a"slash-O"b' declare -xg $b='ab"slash-O"c' env|/usr/bin/grep -P '^[ab]...?'|hexdump -C
00000000 61 d8 62 63 3d 61 62 22 73 6c 61 73 68 2d 4f 22 |aab"slash-O"| 00000010 63 0a 61 d8 62 3d 61 22 73 6c 61 73 68 2d 4f 22 |c.a"slash-O"| 00000020 62 0a 61 3d 61 d8 62 0a 62 3d 61 d8 62 63 0a |b.a=a=a| 0000002f === ...
So no, it does not mandate arbitrary unicode alphabetics. Only the
ones listed
there.
---- Thank-you. This better makes the case, as it only refers to the POSIX reference pages. But it seems that it boils down to the allowed definition of envirionment variables: (http://pubs.opengroup.org/onlinepubs/9699919799/)
2.5.3 Shell Variables
Variables shall be initialized from the environment (as defined by XBD Environment Variables and the exec function in the System Interfaces volume of POSIX.1-2008) and can be given new values with variable assignment commands.
The XBD interface is a description of API facilities for programs to use -- not an end-user-interface. In particular, it says: (under section 8.1) (http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08)
Environment variable names used by the utilities in the Shell and Utilities volume of POSIX.1-2008 consist solely of uppercase letters, digits, and the <underscore> ( '_' ) from the characters defined in Portable Character Set and do not begin with a digit. Other characters may be permitted by an implementation; *****emphasis: applications shall tolerate the presence of such names. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I.e. bash, an application should tolerate(allow) the presence of user-defined names that don't fit the Portable Charset Definition-- though, bash itself, shouldn't create such names to be compatible with the XBD API. There are also multiple discussions that point out that UTF-8 is a valid encoding since the portable chars are all defined as 1 byte, and bytes above this range are not "state" dependent, but are multibyte. (State dependent was described as a situation where you needed to know what state a character decoder was in, when starting to decode a new character, in order to decode it -- it doesn't refer to the fact that individual character entities take 1-4 bytes (in std. Unicode). It was also pointed out that UTF-8 was 8-bit clean in that all binary values could be encoded in UTF-8 -- and then decoded to get the original, same text. (p.s. the above is about 6 hours of internet research, so please excuse internal sequencing oddities...getting a bit brain-dead on researching this...;-)...).
