[issue7643] What is an ASCII linebreak?

Florent Xicluna Fri, 08 Jan 2010 03:42:49 -0800

Florent Xicluna <[email protected]> added the comment:

It's confusing.


There's a specific annex UAX #14 which defines "Line Breaking Properties".
Some properties are defines as "Mandatory Line Breaks (non-tailorable)":
  BK, CR, LF, NL

And the resulting list is different:
                                       CAT BIDI BRK
------------------------------------------------------------------------000A    
LF  LINE FEED                   Cc  B   LF
000B    VT  LINE TABULATION             Cc  S   BK (since Unicode 5.0) 
000C    FF  FORM FEED                   Cc  WS  BK
000D    CR  CARRIAGE RETURN             Cc  B   CR
0085    NEL NEXT LINE                   Cc  B   NL (C1 Control Code)
2028    LS  LINE SEPARATOR              Zl  WS  BK
2029    PS  PARAGRAPH SEPARATOR         Zp  B   BK
------------------------------------------------------------------------

Differences:
 - VT and FF are mandatory breaks (even if “implementations are not
   required to support the VT character”)
 - FS, GS, US are combined marks (CM): “Prohibit a line break between
   the character and the preceding character”

According to this Annex, the current splitlines() implementation violates the 
Unicode standard.

References:
 - Unicode Standard Annex #14 - Line Breaking Algorithm
   http://www.unicode.org/reports/tr14/
 - UCD LineBreak.txt
   http://www.unicode.org/Public/5.2.0/ucd/LineBreak.txt

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue7643>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue7643] What is an ASCII linebreak?

Reply via email to