Marc-Andre Lemburg <m...@egenix.com> added the comment: Florent Xicluna wrote: > > Florent Xicluna <la...@yahoo.fr> added the comment: > > It's confusing. > > There's a specific annex UAX #14 which defines "Line Breaking Properties". > Some properties are defines as "Mandatory Line Breaks (non-tailorable)": > BK, CR, LF, NL
Note that a line breaking algorithm is something different than a line split algorithm. The latter is used to separate lines at pre-defined positions in the text, the former is used to format a piece of text to fit e.g. into a certain width of available character positions. .splitlines() implements a line splitting algorithm, not a line breaking one. > And the resulting list is different: > CAT BIDI BRK > ------------------------------------------------------------------------ > 000A LF LINE FEED Cc B LF > 000B VT LINE TABULATION Cc S BK (since Unicode 5.0) > 000C FF FORM FEED Cc WS BK > 000D CR CARRIAGE RETURN Cc B CR > 0085 NEL NEXT LINE Cc B NL (C1 Control Code) > 2028 LS LINE SEPARATOR Zl WS BK > 2029 PS PARAGRAPH SEPARATOR Zp B BK > ------------------------------------------------------------------------ > > Differences: > - VT and FF are mandatory breaks (even if “implementations are not > required to support the VT character”) > - FS, GS, US are combined marks (CM): “Prohibit a line break between > the character and the preceding character” > > According to this Annex, the current splitlines() implementation violates the > Unicode standard. It appears so and I guess that's an oversight on my part when writing the code: in Unicode 2.1 (the version I started with), FF was marked as "B", later on Unicode 3.0 was published and the new LineBreak.txt file was added to the standard. FF was changed to "WS" and instead marked as "BK" in that new LineBreak.txt file. Since we only used the main UnicodeData.txt file as basis for the type database, the "FF" code point dropped out of the line break code point set. I guess we'll have to add FF and VT to the generator makeunicodedata.py to remedy this. > References: > - Unicode Standard Annex #14 - Line Breaking Algorithm > http://www.unicode.org/reports/tr14/ > - UCD LineBreak.txt > http://www.unicode.org/Public/5.2.0/ucd/LineBreak.txt Thanks, -- Marc-Andre Lemburg eGenix.com ________________________________________________________________________ ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue7643> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com