Date: Fri, 18 Oct 2024 16:28:06 -0400 From: Chet Ramey <chet.ra...@case.edu> Message-ID: <a0363635-e1bf-4c0d-951c-a74b73f00...@case.edu>
| At the time (previous edition of the standard), POSIX defined whitespace | as "In the POSIX locale, white space consists of one or more <blank> ( | <space> and <tab> characters), <newline>, <carriage-return>, <form-feed>, | and <vertical-tab> characters." Yes, that was from XBD 3.142 (In issue 7), labelled "White Space" | The word splitting section wasn't quite | as rigorous as the current version's, but it referenced this definition. Actually, it didn't, which was one of that section's many problems. What it said was The term ``IFS white space'' is used to mean any sequence (zero or more instances) of white-space characters that are in the IFS value There's no reference to anything in XBD, and the term it uses is "white-space" not "white space" which the definitions define. And yes, that hyphen really makes a difference in things like this. | However, the conformance suite tests for this. That has tested for what its developers thought the standard said, rather than what it actually says, before, and probably will again. | The comment in locale_setblanks explains this: some systems, like macOS, | return true from isspace() for characters between 0x80 and 0xff even | though they introduce multibyte characters (every locale besides "C" | in macOS uses UTF-8 encoding). I assume that the macos people assume that if you're fetching multi-byte characters you should be fetching the whole character before testing what kind of object it is. That's certainly what the new standard requires of processing IFS - even though what it is splitting is just treated as bytes, deciding what is IFS white space needs to use properly decoded characters from IFS, not just treat it as a byte string. Then when testing the field being split (or the line read in the case of the read builtin, if the sequence of bytes at the current position matches a character in IFS, then that's a match, if not, one byte gets removed, and try again (that's processing the input as a byte sequence). And yes, that's still something of a mess, but (at least) when using UTF-8 encoding it all ends up working in any case where it possibly can. In other multi-byte locales, anything is possible. kre