On Thu, Sep 19, 2024 at 09:07:06AM +0200, Jakub Jelinek wrote:
> space is ' ' '\t' '\n' '\r' '\f' '\v' in the C locale,
> blank is ' ' '\t'
> cntrl is a lot of chars but not ' '
> if we extend by the safe-ctype
> vspace '\r' '\n'
> nvspace ' ' '\t' '\f' '\v' '\0'
> Obviously, we shouldn't look at '\r' and '\n', those aren't trailing
> characters, those are line separators.
> 
> Would we need to consider all UTF-8 (or EBCDIC-UTF) control characters is
> cntrl?
> 0000..0009    ; Control # Cc  [10] <control-0000>..<control-0009>
> 000B..000C    ; Control # Cc   [2] <control-000B>..<control-000C>
> 000E..001F    ; Control # Cc  [18] <control-000E>..<control-001F>
> 007F..009F    ; Control # Cc  [33] <control-007F>..<control-009F>
> 00AD          ; Control # Cf       SOFT HYPHEN
> 061C          ; Control # Cf       ARABIC LETTER MARK
> 180E          ; Control # Cf       MONGOLIAN VOWEL SEPARATOR
> 200B          ; Control # Cf       ZERO WIDTH SPACE
> 200E..200F    ; Control # Cf   [2] LEFT-TO-RIGHT MARK..RIGHT-TO-LEFT MARK
> 2028          ; Control # Zl       LINE SEPARATOR
> 2029          ; Control # Zp       PARAGRAPH SEPARATOR
> 202A..202E    ; Control # Cf   [5] LEFT-TO-RIGHT EMBEDDING..RIGHT-TO-LEFT 
> OVERRIDE
> 2060..2064    ; Control # Cf   [5] WORD JOINER..INVISIBLE PLUS
> 2065          ; Control # Cn       <reserved-2065>
> 2066..206F    ; Control # Cf  [10] LEFT-TO-RIGHT ISOLATE..NOMINAL DIGIT SHAPES
> FEFF          ; Control # Cf       ZERO WIDTH NO-BREAK SPACE
> FFF0..FFF8    ; Control # Cn   [9] <reserved-FFF0>..<reserved-FFF8>
> FFF9..FFFB    ; Control # Cf   [3] INTERLINEAR ANNOTATION ANCHOR..INTERLINEAR 
> ANNOTATION TERMINATOR
> 13430..1343F  ; Control # Cf  [16] EGYPTIAN HIEROGLYPH VERTICAL 
> JOINER..EGYPTIAN HIEROGLYPH END WALLED ENCLOSURE
> 1BCA0..1BCA3  ; Control # Cf   [4] SHORTHAND FORMAT LETTER OVERLAP..SHORTHAND 
> FORMAT UP STEP
> 1D173..1D17A  ; Control # Cf   [8] MUSICAL SYMBOL BEGIN BEAM..MUSICAL SYMBOL 
> END PHRASE
> E0000         ; Control # Cn       <reserved-E0000>
> E0001         ; Control # Cf       LANGUAGE TAG
> E0002..E001F  ; Control # Cn  [30] <reserved-E0002>..<reserved-E001F>
> E0080..E00FF  ; Control # Cn [128] <reserved-E0080>..<reserved-E00FF>
> E01F0..E0FFF  ; Control # Cn [3600] <reserved-E01F0>..<reserved-E0FFF>
> 
> Wonder why anybody would be interested to find just trailing spaces and not
> trailing tabs or vice versa, so if we have categories, blank would be one,
> then perhaps nvspace as something not including '\0', so just ' ' '\t' '\f'
> '\v' and if really needed, control characters with added ' ', but how to
> call that and would it really need to parse UTF-8/EBCDIC and look at
> pregenerated tables?

And there are also:
0009..000D    ; White_Space # Cc   [5] <control-0009>..<control-000D>
0020          ; White_Space # Zs       SPACE
0085          ; White_Space # Cc       <control-0085>
00A0          ; White_Space # Zs       NO-BREAK SPACE
1680          ; White_Space # Zs       OGHAM SPACE MARK
2000..200A    ; White_Space # Zs  [11] EN QUAD..HAIR SPACE
2028          ; White_Space # Zl       LINE SEPARATOR
2029          ; White_Space # Zp       PARAGRAPH SEPARATOR
202F          ; White_Space # Zs       NARROW NO-BREAK SPACE
205F          ; White_Space # Zs       MEDIUM MATHEMATICAL SPACE
3000          ; White_Space # Zs       IDEOGRAPHIC SPACE

        Jakub

Reply via email to