On Thu, Sep 19, 2024 at 09:07:06AM +0200, Jakub Jelinek wrote: > space is ' ' '\t' '\n' '\r' '\f' '\v' in the C locale, > blank is ' ' '\t' > cntrl is a lot of chars but not ' ' > if we extend by the safe-ctype > vspace '\r' '\n' > nvspace ' ' '\t' '\f' '\v' '\0' > Obviously, we shouldn't look at '\r' and '\n', those aren't trailing > characters, those are line separators. > > Would we need to consider all UTF-8 (or EBCDIC-UTF) control characters is > cntrl? > 0000..0009 ; Control # Cc [10] <control-0000>..<control-0009> > 000B..000C ; Control # Cc [2] <control-000B>..<control-000C> > 000E..001F ; Control # Cc [18] <control-000E>..<control-001F> > 007F..009F ; Control # Cc [33] <control-007F>..<control-009F> > 00AD ; Control # Cf SOFT HYPHEN > 061C ; Control # Cf ARABIC LETTER MARK > 180E ; Control # Cf MONGOLIAN VOWEL SEPARATOR > 200B ; Control # Cf ZERO WIDTH SPACE > 200E..200F ; Control # Cf [2] LEFT-TO-RIGHT MARK..RIGHT-TO-LEFT MARK > 2028 ; Control # Zl LINE SEPARATOR > 2029 ; Control # Zp PARAGRAPH SEPARATOR > 202A..202E ; Control # Cf [5] LEFT-TO-RIGHT EMBEDDING..RIGHT-TO-LEFT > OVERRIDE > 2060..2064 ; Control # Cf [5] WORD JOINER..INVISIBLE PLUS > 2065 ; Control # Cn <reserved-2065> > 2066..206F ; Control # Cf [10] LEFT-TO-RIGHT ISOLATE..NOMINAL DIGIT SHAPES > FEFF ; Control # Cf ZERO WIDTH NO-BREAK SPACE > FFF0..FFF8 ; Control # Cn [9] <reserved-FFF0>..<reserved-FFF8> > FFF9..FFFB ; Control # Cf [3] INTERLINEAR ANNOTATION ANCHOR..INTERLINEAR > ANNOTATION TERMINATOR > 13430..1343F ; Control # Cf [16] EGYPTIAN HIEROGLYPH VERTICAL > JOINER..EGYPTIAN HIEROGLYPH END WALLED ENCLOSURE > 1BCA0..1BCA3 ; Control # Cf [4] SHORTHAND FORMAT LETTER OVERLAP..SHORTHAND > FORMAT UP STEP > 1D173..1D17A ; Control # Cf [8] MUSICAL SYMBOL BEGIN BEAM..MUSICAL SYMBOL > END PHRASE > E0000 ; Control # Cn <reserved-E0000> > E0001 ; Control # Cf LANGUAGE TAG > E0002..E001F ; Control # Cn [30] <reserved-E0002>..<reserved-E001F> > E0080..E00FF ; Control # Cn [128] <reserved-E0080>..<reserved-E00FF> > E01F0..E0FFF ; Control # Cn [3600] <reserved-E01F0>..<reserved-E0FFF> > > Wonder why anybody would be interested to find just trailing spaces and not > trailing tabs or vice versa, so if we have categories, blank would be one, > then perhaps nvspace as something not including '\0', so just ' ' '\t' '\f' > '\v' and if really needed, control characters with added ' ', but how to > call that and would it really need to parse UTF-8/EBCDIC and look at > pregenerated tables?
And there are also: 0009..000D ; White_Space # Cc [5] <control-0009>..<control-000D> 0020 ; White_Space # Zs SPACE 0085 ; White_Space # Cc <control-0085> 00A0 ; White_Space # Zs NO-BREAK SPACE 1680 ; White_Space # Zs OGHAM SPACE MARK 2000..200A ; White_Space # Zs [11] EN QUAD..HAIR SPACE 2028 ; White_Space # Zl LINE SEPARATOR 2029 ; White_Space # Zp PARAGRAPH SEPARATOR 202F ; White_Space # Zs NARROW NO-BREAK SPACE 205F ; White_Space # Zs MEDIUM MATHEMATICAL SPACE 3000 ; White_Space # Zs IDEOGRAPHIC SPACE Jakub