from:"Tom Christiansen"

[issue12568] Add functions to get the width in columns of a character

2012-03-10 Thread Tom Christiansen

Tom Christiansen added the comment: >Martin v. L=C3=B6wis added the comment: >> I would encourage you to look at the Perl CPAN module Unicode::LineBreak, >> which fully implements tr11. >Thanks for the pointer! >> If you'd like, I can show you a program t

[issue12568] Add functions to get the width in columns of a character

2012-03-10 Thread Tom Christiansen

Tom Christiansen added the comment: >Martin v. L=C3=B6wis added the comment: >> Martin, I think you meant to write "if w =3D=3D 'A':". >> Some very common characters have ambiguous widths though (e.g. the Greek = >alphabet), so you can't just raise

[issue12568] Add functions to get the width in columns of a character

2012-03-10 Thread Tom Christiansen

Tom Christiansen added the comment: I would encourage you to look at the Perl CPAN module Unicode::LineBreak, which fully implements tr11. It includes Unicode::GCString, a class that has a columns() method to determine the print columns. This is very fancy in the case of Asian widths, but of

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-10-20 Thread Tom Christiansen

Tom Christiansen added the comment: Yes, it looks good. Thank you very much. -tom -- ___ Python tracker <http://bugs.python.org/issue12753> ___ ___ Python-bug

[issue12568] Add functions to get the width in columns of a character

2011-10-14 Thread Tom Christiansen

Tom Christiansen added the comment: > Martin v. Löwis added the comment: > I think the WideCharToMultibyte approach is just incorrect. > I'm -1 on using wcswidth, though. Like you, I too seriously question using wcswidth() for this at all: The wcswidth() function either

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-10-09 Thread Tom Christiansen

Tom Christiansen added the comment: Ezio Melotti wrote on Sun, 09 Oct 2011 13:21:00 -: > Here is a new patch that stores the names of aliases and named > sequences in the Private Use Area. Looks good! Thanks! --tom -- title: \N{...} neglects formal aliases and

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-10-03 Thread Tom Christiansen

Tom Christiansen added the comment: Ezio Melotti wrote on Mon, 03 Oct 2011 04:15:51 -: >> But it still has to happen at compile time, of course, so I don't know >> what you could do in Python. Is there any way to change how the compiler >> behaves even vaguely

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-10-02 Thread Tom Christiansen

Tom Christiansen added the comment: >> Really? White space makes things harder to read? I thought Pythonistas >> believed the opposite of that. > I was surprised at that too ;-). One person's opinion in a specific > context. Don't generalize. The example I init

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-10-02 Thread Tom Christiansen

Tom Christiansen added the comment: Ezio Melotti wrote on Sun, 02 Oct 2011 06:46:26 -: > Actually Python doesn't seem to support \N{LINE FEED (LF)}, most likely bec= > ause that's a Unicode 1 name, and nowadays these codepoints are simply mark= > ed as ''.

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-10-01 Thread Tom Christiansen

Tom Christiansen added the comment: >> Perl does not provide the old 1.0 names at all. We don't have a Unicode >> 1.0 legacy to support, which makes this cleaner. However, we do provide >> for the names of the C0 and C1 Control Codes, because apart from Unicode >&g

[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-10-01 Thread Tom Christiansen

Tom Christiansen added the comment: Martin v. Löwis wrote on Sat, 01 Oct 2011 10:59:48 -: >> * Word characters are Alphabetic + Mn+Mc+Me + Nd + Pc. > Where did you get that definition from? UTS#18 defines > "", which is Alphabetic + U+200C + U+200D > (i.e.

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-09-30 Thread Tom Christiansen

Tom Christiansen added the comment: >Ezio Melotti added the comment: > Leaving named sequences for unicodedata.lookup() only (and not for > \N{}) makes sense. There are certainly advantages to that strategy: you don't have to deal with [\N{sequence}] issues. If t

[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-09-30 Thread Tom Christiansen

Tom Christiansen added the comment: > Martin v. Löwis added the comment: > "Split S into words. Change the first letter in a word to upper-case, Except that I think you actually mean that the first "letter" is changed into titlecase not uppercase. One might also sa

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-19 Thread Tom Christiansen

Tom Christiansen added the comment: It appears that I'm right about surrogates, but wrong about noncharacters. I'm seeking a clarification there. --tom -- ___ Python tracker <http://bugs.python.o

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-19 Thread Tom Christiansen

Tom Christiansen added the comment: No good news on the Java front. They do all kinds of things wrong. For example, they allow intermixed CESU-8 and UTF-8 in a real UTF-8 input stream, which is illegal. There's more they do wrong, including in their documentation, but I won'

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-19 Thread Tom Christiansen

Tom Christiansen added the comment: Ezio Melotti wrote on Mon, 19 Sep 2011 11:11:48 -: > We could also look at what other languages do and/or ask to the > Unicode consortium. I will look at what Java does a bit later on this morning, which is the only other commonly used la

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-18 Thread Tom Christiansen

Tom Christiansen added the comment: "Terry J. Reedy" wrote on Thu, 08 Sep 2011 18:56:11 -: >On 9/8/2011 4:32 AM, Ezio Melotti wrote: >> So to summarize a bit, there are different possible level of strictness: >>1) all the possible encodable values,

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-07 Thread Tom Christiansen

Tom Christiansen added the comment: Ezio Melotti wrote on Sat, 03 Sep 2011 00:28:03 -: > Ezio Melotti added the comment: > Or they are still called UTF-8 but used in combination with different error > handlers, like surrogateescape and surrogatepass. The "plain

[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-29 Thread Tom Christiansen

Tom Christiansen added the comment: Antoine Pitrou wrote on Mon, 29 Aug 2011 13:21:06 -: > It's not only "typographically speaking", it's really a spelling error, > even in hand-written text :-) Sure, and so too is omitting an accent mark or diaeresi

[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-28 Thread Tom Christiansen

Tom Christiansen added the comment: Antoine Pitrou wrote on Sat, 27 Aug 2011 20:04:56 -: >> Neither am I. Even in "old-style" English with ae and oe, one wrote >> ÆGYPT and ÆSIR all caps but Ægypt and Æsir in titlecase, not *Aegypt or >> *Aesir. Simi

[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-27 Thread Tom Christiansen

Tom Christiansen added the comment: Guido van Rossum wrote on Sat, 27 Aug 2011 16:15:33 -: > Although personally I don't have much of an intuition for what > titlecase means (and why it's important), perhaps because I'm not > familiar with any language where t

[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-27 Thread Tom Christiansen

Tom Christiansen added the comment: Guido van Rossum wrote on Fri, 26 Aug 2011 21:11:24 -: > Would this also affect .islower() and friends? SHORT VERSION: (7 lines) I don't believe so, but the relationship between lower() and islower() is not as clear to me as I wo

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-27 Thread Tom Christiansen

Tom Christiansen added the comment: Guido van Rossum wrote on Sat, 27 Aug 2011 03:26:21 -: > To me, making (default) iteration deviate from indexing is anathema. So long is there's a way to interate through a string some other way that by code unit, that's fine. Howe

[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-26 Thread Tom Christiansen

Tom Christiansen added the comment: Here’s my casing test suite; I thought I sent it in but the mux file here isn’t the full thing. It does several things, including letting you run it with regex vs re. It also checks for the islower, etc functions. It has both simple and full (and turkic

[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-26 Thread Tom Christiansen

Tom Christiansen added the comment: Guido van Rossum wrote on Fri, 26 Aug 2011 21:11:24 -: > Guido van Rossum added the comment: > I presume this applies to builtin str methods like .lower(), right? I > think it is a good thing to do for Python 3.3. Yes, the full casemap

[issue12735] request full Unicode collation support in std python library

2011-08-26 Thread Tom Christiansen

Tom Christiansen added the comment: Guido van Rossum wrote on Fri, 26 Aug 2011 21:55:03 -: > I know I sound like NIH, but I'm always reluctant to add a big 3rd > party lib like ICU to the permanent dependencies of all future Python > distros. If people want to use IC

[issue12735] request full Unicode collation support in std python library

2011-08-26 Thread Tom Christiansen

Tom Christiansen added the comment: I should probably mention the importance in the design of a UCA module of being able to specify which UCA version number you want it to behave like in case you plan to override some of the DUCET entries. That way if you run under a later UCA with different

[issue12735] request full Unicode collation support in std python library

2011-08-26 Thread Tom Christiansen

Tom Christiansen added the comment: Raymond Hettinger added the comment: > I would like to be involved in the design of the API for a UCA module > and its routines for loading Unicode Collation Element Tables (not > making the mistake of using global state like the locale module d

[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-08-26 Thread Tom Christiansen

Tom Christiansen added the comment: Guido van Rossum wrote on Fri, 26 Aug 2011 21:16:57 -: > Yeah, this should be fixed in 3.3 and probably backported to 3.2 > and 2.7. (There is already no guarantee that len(s) == > len(s.title()), right?) Well, *I* don't know of any

[issue12735] request full Unicode collation support in std python library

2011-08-26 Thread Tom Christiansen

Tom Christiansen added the comment: > Sounds like a fair feature request for Python 3.3, as long as the > intention is that users must import some module from the standard > library and use functions defined in that module. The operations and > methods defined for str in

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-08-19 Thread Tom Christiansen

Tom Christiansen added the comment: Matthew Barnett wrote on Fri, 19 Aug 2011 23:36:45 -: > For the "Line_Break" property, one of the possible values is > "Inseparable", with 2 permitted aliases, the shorter "IN" (which > is reasonable) and &qu

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-08-19 Thread Tom Christiansen

Tom Christiansen added the comment: "Terry J. Reedy" wrote on Fri, 19 Aug 2011 22:50:58 -: > My current opinion is that adding the aliases might be done in current > releases. It certainly would serve the any user who does not know to > misspell 'FTHORA'

[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Tom Christiansen

Tom Christiansen added the comment: Marc-Andre Lemburg wrote on Tue, 16 Aug 2011 12:11:22 -: > The reasoning behind e.g. "ISSURROGATE" is that those names originate > from and are consistent with the already existing ISLOWER/ISUPPER/ISTITLE > macros which in ret

[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Tom Christiansen

Tom Christiansen added the comment: Ezio Melotti wrote on Tue, 16 Aug 2011 09:23:50 -: > All the other macros[0] follow the same convention, e.g. Py_UNICODE_ISLOWER > and Py_UNICODE_TOLOWER. I agree that keeping the words separate makes them > more readable though. &

[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Tom Christiansen

Tom Christiansen added the comment: Antoine Pitrou wrote on Tue, 16 Aug 2011 09:18:46 -: >> I think the 4 macros: >> #define _Py_UNICODE_ISSURROGATE >> #define _Py_UNICODE_ISHIGHSURROGATE >> #define _Py_UNICODE_ISLOWSURROGATE >> #define _Py_UNICOD

[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Tom Christiansen

Tom Christiansen added the comment: I now see there are lots of good things in the BOM FAQ that have come up lately regarding surrogates and other illegal characters, and about what can go in data streams. I quote a few of these from http://unicode.org/faq/utf_bom.html below: Q: How do

[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Tom Christiansen

Tom Christiansen added the comment: >Ezio Melotti added the comment: >I think the 4 macros: > #define _Py_UNICODE_ISSURROGATE > #define _Py_UNICODE_ISHIGHSURROGATE > #define _Py_UNICODE_ISLOWSURROGATE > #define _Py_UNICODE_JOIN_SURROGATES >are quite straightforward an

[issue12734] Request for property support in Python re lib

2011-08-15 Thread Tom Christiansen

Changes by Tom Christiansen : Removed file: http://bugs.python.org/file22902/nametests.py ___ Python tracker <http://bugs.python.org/issue12734> ___ ___ Python-bugs-list m

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-08-15 Thread Tom Christiansen

Tom Christiansen added the comment: Here’s the right test file for the right ticket. -- Added file: http://bugs.python.org/file22903/nametests.py ___ Python tracker <http://bugs.python.org/issue12

[issue12730] Python's casemapping functions are untrustworthy due to narrow/wide build issues

2011-08-15 Thread Tom Christiansen

Tom Christiansen added the comment: >Terry J. Reedy added the comment: >Adding Symbola filled in the symbols and emoticons lines. >The gothic chars are still missing even with Alfios. That's too bad, as the Gothic paternoster is kinda cute. :) Hm, I wonder where I got them

[issue12734] Request for property support in Python re lib

2011-08-15 Thread Tom Christiansen

Tom Christiansen added the comment: Oh whoops, that was the long ticket. Shall I reupload to the right number? -- ___ Python tracker <http://bugs.python.org/issue12

[issue12734] Request for property support in Python re lib

2011-08-15 Thread Tom Christiansen

Tom Christiansen added the comment: Sorry I didn't include a test case. Hope this makes up for it. If not, please tell me how to write better test cases. :( Yeah ok, so I'm a bit persnickety or even unorthodox about my vertical alignment, but it really helps to make what is diff

[issue12730] Python's casemapping functions are untrustworthy due to narrow/wide build issues

2011-08-15 Thread Tom Christiansen

Tom Christiansen added the comment: >Terry J. Reedy added the comment: > You are right, FF switched on me without notice. Bad FF. Thank you! What > I now see makes much more sense. >[ "𐐼𐐯𐑅𐐨𐑉𐐯𐐻", "𐐼𐐯𐑅𐐨𐑉𐐯𐐻", "𐐔𐐯𐑅𐐨𐑉𐐯𐐻", "𐐔

[issue12730] Python's casemapping functions are untrustworthy due to narrow/wide build issues

2011-08-15 Thread Tom Christiansen

Tom Christiansen added the comment: >Terry J. Reedy added the comment: > My Firefox is already set at utf-8. More likely a font limitation. I > will look again after installing one of the fonts Tom suggested. Symbola is best for exotic glyphs, especially astral ones. Alfios just l

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-08-15 Thread Tom Christiansen

New submission from Tom Christiansen : Unicode character names share a common namespace with formal aliases and with named sequences, but Python recognizes only the original name. That means not everything in the namespace is accessible from Python. (If this is construed to be an extant bug

[issue12746] normalization is affected by unicode width

2011-08-15 Thread Tom Christiansen

Changes by Tom Christiansen : -- nosy: +tchrist ___ Python tracker <http://bugs.python.org/issue12746> ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Tom Christiansen

Tom Christiansen added the comment: Ezio Melotti wrote on Mon, 15 Aug 2011 04:56:55 -: > Another thing I noticed is that (at least on wide builds) surrogate pairs are > not joined "on the fly": > >>> p > '\ud800\udc00' > >>> len(p

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Tom Christiansen

Tom Christiansen added the comment: I wrote: >> Python's narrow builds are, in a sense, 'between' UCS-2 and UTF-16. > So I'm finding. Perhaps that's why I keep getting confused. I do have a > pretty firm > notion of what UCS-2 and UTF-16 are, and

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Tom Christiansen

Tom Christiansen added the comment: "Terry J. Reedy" wrote on Mon, 15 Aug 2011 00:26:53 -: > PS: The OSCON link in msg142036 currently gives me 404 not found Sorry, I wrote http://training.perl.com/OSCON/index.html but meant http://training.perl.

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Tom Christiansen

Tom Christiansen added the comment: Ezio Melotti wrote on Sun, 14 Aug 2011 17:46:55 -: >> I'm a bit confused on this. You no longer fix bugs in Python 2? > We do, but it's unlikely that we will introduce major changes in behavior. > Even if we had to get rid o

[issue12749] lib re cannot match non-BMP ranges (all versions, all builds)

2011-08-14 Thread Tom Christiansen

Tom Christiansen added the comment: Ezio Melotti wrote on Sun, 14 Aug 2011 17:15:52 -: >> You're right: my wide build is not Python3, just Python2. > And is it failing? Here the tests pass on the wide builds, on both Python 2 > and 3. Perhaps I am doing something

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Tom Christiansen

Tom Christiansen added the comment: Ezio Melotti wrote on Sun, 14 Aug 2011 07:15:09 -: > For example I don't think removing the 0x10 upper limit is going to > happen -- even if it might be useful for other things. I agree entirely. That's why I appended a tr

[issue12749] lib re cannot match non-BMP ranges (all versions, all builds)

2011-08-14 Thread Tom Christiansen

Tom Christiansen added the comment: >Ezio Melotti added the comment: >On wide 3.2 it passes too, so the failure is limited to narrow builds (are = >you sure that it fails on wide builds for you?). You're right: my wide build is not Python3, just Python2. In fact, it's

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Tom Christiansen

Tom Christiansen added the comment: Ezio Melotti wrote on Sun, 14 Aug 2011 07:15:09 -: >> Unicode says you can't put surrogates or noncharacters in a >> UTF-anything stream. It's a bug to do so and pretend it's a >> UTF-whatever. > The UTF-8 codec

[issue12749] lib re cannot match non-BMP ranges (all versions, all builds)

2011-08-14 Thread Tom Christiansen

New submission from Tom Christiansen : On neither narrow nor wide builds does this UTF8-encoded bit run without raising an exception: if re.search("[𝒜-𝒵]", "𝒞", re.UNICODE): print("match 1 passed") else: print("match 2 failed")

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Tom Christiansen

Tom Christiansen added the comment: Ezio Melotti added the comment: >> It is simply a design error to pretend that the number of characters >> is the number of code units instead of code points. A terrible and >> ugly one, but it does not mean you are UCS-2. > If you

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Tom Christiansen

Tom Christiansen added the comment: >> Here's why I say that Python uses UTF-16 not UCS-2 on its narrow builds. >> Perhaps someone could tell me why the Python documentation says it uses >> UCS-2 on a narrow build. > There's a disagreement on that point betwee

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Tom Christiansen

Tom Christiansen added the comment: Antoine Pitrou wrote on Sat, 13 Aug 2011 21:09:52 -: > And/or a lookup table giving the byte offset of, say, every 16th > character. It gives you a O(1) lookup with a relatively reasonable > constant cost (you have to scan for les

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Tom Christiansen

Tom Christiansen added the comment: Matthew Barnett wrote on Sat, 13 Aug 2011 20:57:40 -: > There are occasions when you want to do string slicing, often of the form: > pos = my_str.index(x) > endpos = my_str.index(y) > substring = my_str[pos : endpos] Me, I wo

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Tom Christiansen

Tom Christiansen added the comment: David Murray wrote: > Tom, note that nobody is arguing that what you are requesting is a bad > thing :) There looked to be minor some resistance, based on absolute backwards compatibility even if wrong, regarding changing anything *at all* in re

[issue12732] Can't portably use Unicode in Python identifiers

2011-08-12 Thread Tom Christiansen

Tom Christiansen added the comment: "Terry J. Reedy" wrote on Fri, 12 Aug 2011 23:05:27 -: > Ouch! > Do the rejected characters qualify as identifier characters as defined > in Reference 2.3 Identifiers and keywords? > http://docs.python.org/py3k/reference

[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2011-08-12 Thread Tom Christiansen

Tom Christiansen added the comment: > Terry J. Reedy added the comment: > However desireable it would be, I do not believe there is any claim in the = > manual that the re module follows the evolving Unicode consortium r.e. stan= My from the hip thought is that if re cannot be

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-12 Thread Tom Christiansen

Tom Christiansen added the comment: "Terry J. Reedy" wrote on Fri, 12 Aug 2011 22:21:59 -: > Does the regex module handle these particular issues better? No, it currently does not. One would have to ask Matthew directly, but I believe it was because he was trying to st

[issue12728] Python re lib fails case insensitive matches on Unicode data

2011-08-12 Thread Tom Christiansen

Tom Christiansen added the comment: > Terry J. Reedy added the comment: > I am not sure that everyone will agree that this is a bug, rather than a fe= > ature request, or that if a bug, that it should be changed in existing rele= > ases and possibly break running code. The d

[issue11230] "Full unicode import system" not in 3.2

2011-08-12 Thread Tom Christiansen

Tom Christiansen added the comment: Whoops, I meant that it appears that Python runs its identifiers through NFC. How that gets along with a filesystem that has quasi-NFD filenames I'm not sure, but it seems like it might be a variant of the case-insensitivity issue in file

[issue2857] add codec for java modified utf-8

2011-08-11 Thread Tom Christiansen

Tom Christiansen added the comment: Please do not call this "utf-8-java". It is called "cesu-8" per UTS#18 at: http://unicode.org/reports/tr26/ CESU-8 is *not* a a valid Unicode Transform Format and should not be called UTF-8. It is a real pain in the butt, caused by pe

[issue11230] "Full unicode import system" not in 3.2

2011-08-11 Thread Tom Christiansen

Tom Christiansen added the comment: How does this work for modules that have filesystem names different from the one used for import? The issue I'm thinking about is that the Mac HSF+ filesystem keeps its Unicode filenames in (close to) NFD form. That means that a module named "c

[issue12568] Add functions to get the width in columns of a character

2011-08-11 Thread Tom Christiansen

Tom Christiansen added the comment: I can attest that being able to get the columns of a grapheme cluster is very important for printing, because you need this to do correct linebreaking. There might be something you can steal from http://search.cpan.org/perldoc?Unicode::GCString

[issue12734] Request for property support in Python re lib

2011-08-11 Thread Tom Christiansen

Tom Christiansen added the comment: I've been a lot of testing of Matthew's regex library against UTS#18 issues, but only somewhat incidentally testing re. To use regex, one has to accept that certain things will work differently than they work in re, because he is followi

[issue12737] string.title() is overzealous by upcasing combining marks inappropriately

2011-08-11 Thread Tom Christiansen

New submission from Tom Christiansen : Python's string.title() function claims it titlecases the first letter in each word and lowercases the rest. However, this is not true. It is not using either of the two word detection algorithms that Unicode provides. One allows you to use a l

[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-11 Thread Tom Christiansen

New submission from Tom Christiansen : Python's casemapping functions only use what Unicode calls simple casemaps. These are only appropriate for functions that operate on single characters alone, not for those that operate on strings. The reason for this is that you get much better re

[issue12735] request full Unicode collation support in std python library

2011-08-11 Thread Tom Christiansen

New submission from Tom Christiansen : Python has no standard support for the Unicode Collation Library as explained in UTS #10. This is request that UCA library be added to the standard Python distribution. Collation underlies virtually everything we do with text, not just sorting but any

[issue12734] Request for property support in Python re lib

2011-08-11 Thread Tom Christiansen

New submission from Tom Christiansen : Python supports no Unicode properties in its re library, making it unsuitable for work with Unicode. This is therefore a formal request for the Python re library to support Unicode properties. The eleven properties required by Unicode Technical Report

[issue12733] Request for grapheme support in Python re lib

2011-08-11 Thread Tom Christiansen

New submission from Tom Christiansen : Without proper grapheme support in the regular expression library, it is impossible to correctly process Unicode. And the very least, one needs the \X escape supported, which is an extended grapheme cluster per UTS#18. This escape is supported by many

[issue12728] Python re lib fails case insensitive matches on Unicode data

2011-08-11 Thread Tom Christiansen

Changes by Tom Christiansen : -- components: +Regular Expressions -Library (Lib) type: -> behavior ___ Python tracker <http://bugs.python.org/issue12728> ___ _

[issue12732] Can't portably use Unicode in Python identifiers

2011-08-11 Thread Tom Christiansen

New submission from Tom Christiansen : You cannot reliably use Unicode in Python identifiers because of the narrow/wide build issue. The enclosed file is fine on wide builds but gets compiler errors on narrow ones during compilation. Go, Ruby, Java, and Perl all handle this situation without

[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2011-08-11 Thread Tom Christiansen

New submission from Tom Christiansen : You cannot use Python's lib re for handling Unicode regular expressions because it violates the standard set out for the same in UTS#18 on Unicode Regular Expressions in RL1.2a on compatibility properties. What \w is allowed to match is cl

[issue12730] Python's casemapping functions are untrustworthy due to narrow/wide build issues

2011-08-11 Thread Tom Christiansen

New submission from Tom Christiansen : You cannot use Python's casemapping functions on Unicode data because they fail on narrow builds. This makes it impossible to write portable code in Python that can cope with full Unicode. I've tried several times to submit this bug, bu

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-11 Thread Tom Christiansen

New submission from Tom Christiansen : Python is in flagrant violation of the very most basic premises of Unicode Technical Report #18 on Regular Expressions, which requires that a regex engine support Unicode characters as "basic logical units independent of serialization lik

[issue12728] Python re lib fails case insensitive matches on Unicode data

2011-08-11 Thread Tom Christiansen

New submission from Tom Christiansen : The Python re library is broken in its approach to case-insensitive matches. It erroneously attempts to compare lowercase mappings. This is wrong. You must compare the Unicode casefolds, not the Unicode casemaps. Otherwise you get wrong answers. I

80 matches

Mail list logo