Florent Xicluna <la...@yahoo.fr> added the comment: It's confusing.
There's a specific annex UAX #14 which defines "Line Breaking Properties". Some properties are defines as "Mandatory Line Breaks (non-tailorable)": BK, CR, LF, NL And the resulting list is different: CAT BIDI BRK ------------------------------------------------------------------------000A LF LINE FEED Cc B LF 000B VT LINE TABULATION Cc S BK (since Unicode 5.0) 000C FF FORM FEED Cc WS BK 000D CR CARRIAGE RETURN Cc B CR 0085 NEL NEXT LINE Cc B NL (C1 Control Code) 2028 LS LINE SEPARATOR Zl WS BK 2029 PS PARAGRAPH SEPARATOR Zp B BK ------------------------------------------------------------------------ Differences: - VT and FF are mandatory breaks (even if “implementations are not required to support the VT character”) - FS, GS, US are combined marks (CM): “Prohibit a line break between the character and the preceding character” According to this Annex, the current splitlines() implementation violates the Unicode standard. References: - Unicode Standard Annex #14 - Line Breaking Algorithm http://www.unicode.org/reports/tr14/ - UCD LineBreak.txt http://www.unicode.org/Public/5.2.0/ucd/LineBreak.txt ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue7643> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com