Marc-Andre Lemburg <m...@egenix.com> added the comment: Florent Xicluna wrote: > > Florent Xicluna <la...@yahoo.fr> added the comment: > > Some technical background. > > == Unicode == > > According to the Unicode Standard Annex #9, a character with > bidirectional class B is a "Paragraph Separator". And “Because a > Paragraph Separator breaks lines, there will be at most one per line, > at the end of that line.” > > As a consequence, there's 3 reasons to identify a character as a > linebreak: > - General Category Zl "Line Separator" > - General Category Zp "Paragraph Separator" > - Bidirectional Class B "Paragraph Separator"
This definition is what we use in Python for Py_UNICODE_ISLINEBREAK(ch). > There's 8 linebreaks in the current Unicode Database (5.2): > ------------------------------------------------------------------------ > 000A LF LINE FEED Cc B > 000D CR CARRIAGE RETURN Cc B > 001C FS INFORMATION SEPARATOR FOUR Cc B (UCD 3.1 FILE SEPARATOR) > 001D GS INFORMATION SEPARATOR THREE Cc B (UCD 3.1 GROUP SEPARATOR) > 001E RS INFORMATION SEPARATOR TWO Cc B (UCD 3.1 RECORD SEPARATOR) > 0085 NEL NEXT LINE Cc B (C1 Control Code) > 2028 LS LINE SEPARATOR Zl WS (Unicode) > 2029 PS PARAGRAPH SEPARATOR Zp B (Unicode) > ------------------------------------------------------------------------ And that's the list we're currently using. > == ASCII == > > The Standard ASCII control codes (C0) are in the range 00-1F. > It limits the list to LF, CR, FS, GS, RS. > Regarding the last three, they are not considered as linebreaks: > “The separators (File, Group, Record, and Unit: FS, GS, RS and US) were made > to > structure data, usually on a tape, in order to simulate punched cards. End of > medium (EM) warns that the tape (or whatever) is ending. While many systems > use > CR/LF and TAB for structuring data, it is possible to encounter the separator > control characters in data that needs to be structured. The separator control > characters are not overloaded; there is no general use of them except to > separate data into structured groupings. Their numeric values are contiguous > with the space character, which can be considered a member of the group, as a > word separator.” > (Ref: http://en.wikipedia.org/wiki/Control_character#Data_structuring) > > In conclusion, it may be better to keep things unchanged. Agreed. > We may add some words to the documentation for str.splitlines() and > bytes.splitlines() to explain what is considered a line break character. For ASCII we should make the list of characters explicit. For Unicode, we should mention the above definition and give the table as example list (the Unicode database may add more such characters in the future). > References: > - The Unicode Character Database (UCD): http://www.unicode.org/ucd/ > - UCD Property Values: http://unicode.org/reports/tr44/#Property_Values > - The Bidirectional Algorithm: http://www.unicode.org/reports/tr9/ > - C0 and C1 Control Codes: > http://en.wikipedia.org/wiki/C0_and_C1_control_codes ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue7643> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com