Re: IFS whitespace definition

Robert Elz Mon, 21 Oct 2024 07:15:49 -0700

    Date:        Fri, 18 Oct 2024 16:28:06 -0400
    From:        Chet Ramey <[email protected]>
    Message-ID:  <[email protected]>


  | At the time (previous edition of the standard), POSIX defined whitespace
  | as "In the POSIX locale, white space consists of one or more <blank> (
  | <space> and <tab> characters), <newline>, <carriage-return>, <form-feed>,
  | and <vertical-tab> characters."

Yes, that was from XBD 3.142 (In issue 7), labelled "White Space"

  | The word splitting section wasn't quite
  | as rigorous as the current version's, but it referenced this definition.

Actually, it didn't, which was one of that section's many problems.

What it said was

        The term ``IFS white space'' is used to mean any sequence (zero or
        more instances) of white-space characters that are in the IFS value

There's no reference to anything in XBD, and the term it uses is
"white-space" not "white space" which the definitions define.   And
yes, that hyphen really makes a difference in things like this.

  | However, the conformance suite tests for this.

That has tested for what its developers thought the standard said,
rather than what it actually says, before, and probably will again.

  | The comment in locale_setblanks explains this: some systems, like macOS,
  | return true from isspace() for characters between 0x80 and 0xff even
  | though they introduce multibyte characters (every locale besides "C"
  | in macOS uses UTF-8 encoding).

I assume that the macos people assume that if you're fetching multi-byte
characters you should be fetching the whole character before testing what
kind of object it is.

That's certainly what the new standard requires of processing IFS - even
though what it is splitting is just treated as bytes, deciding what is
IFS white space needs to use properly decoded characters from IFS, not
just treat it as a byte string.   Then when testing the field being
split (or the line read in the case of the read builtin, if the sequence
of bytes at the current position matches a character in IFS, then that's
a match, if not, one byte gets removed, and try again (that's processing
the input as a byte sequence).   And yes, that's still something of a mess,
but (at least) when using UTF-8 encoding it all ends up working in any
case where it possibly can.   In other multi-byte locales, anything is
possible.

kre

Re: IFS whitespace definition

Reply via email to