Florent Xicluna <la...@yahoo.fr> added the comment: Some technical background.
== Unicode == According to the Unicode Standard Annex #9, a character with bidirectional class B is a "Paragraph Separator". And “Because a Paragraph Separator breaks lines, there will be at most one per line, at the end of that line.” As a consequence, there's 3 reasons to identify a character as a linebreak: - General Category Zl "Line Separator" - General Category Zp "Paragraph Separator" - Bidirectional Class B "Paragraph Separator" There's 8 linebreaks in the current Unicode Database (5.2): ------------------------------------------------------------------------ 000A LF LINE FEED Cc B 000D CR CARRIAGE RETURN Cc B 001C FS INFORMATION SEPARATOR FOUR Cc B (UCD 3.1 FILE SEPARATOR) 001D GS INFORMATION SEPARATOR THREE Cc B (UCD 3.1 GROUP SEPARATOR) 001E RS INFORMATION SEPARATOR TWO Cc B (UCD 3.1 RECORD SEPARATOR) 0085 NEL NEXT LINE Cc B (C1 Control Code) 2028 LS LINE SEPARATOR Zl WS (Unicode) 2029 PS PARAGRAPH SEPARATOR Zp B (Unicode) ------------------------------------------------------------------------ == ASCII == The Standard ASCII control codes (C0) are in the range 00-1F. It limits the list to LF, CR, FS, GS, RS. Regarding the last three, they are not considered as linebreaks: “The separators (File, Group, Record, and Unit: FS, GS, RS and US) were made to structure data, usually on a tape, in order to simulate punched cards. End of medium (EM) warns that the tape (or whatever) is ending. While many systems use CR/LF and TAB for structuring data, it is possible to encounter the separator control characters in data that needs to be structured. The separator control characters are not overloaded; there is no general use of them except to separate data into structured groupings. Their numeric values are contiguous with the space character, which can be considered a member of the group, as a word separator.” (Ref: http://en.wikipedia.org/wiki/Control_character#Data_structuring) In conclusion, it may be better to keep things unchanged. We may add some words to the documentation for str.splitlines() and bytes.splitlines() to explain what is considered a line break character. References: - The Unicode Character Database (UCD): http://www.unicode.org/ucd/ - UCD Property Values: http://unicode.org/reports/tr44/#Property_Values - The Bidirectional Algorithm: http://www.unicode.org/reports/tr9/ - C0 and C1 Control Codes: http://en.wikipedia.org/wiki/C0_and_C1_control_codes ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue7643> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com