[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-11-16 Thread Antoine Pitrou
Antoine Pitrou added the comment: Closing this bug as PEP 393 is now implemented and makes so-called "narrow builds" obsolete. Python now has an adaptative internal representation that is able to fit all unicode characters. -- resolution: -> out of date stage: -> committed/rejected

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-22 Thread Alexander Belopolsky
Changes by Alexander Belopolsky : -- nosy: +belopolsky ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://m

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-21 Thread Zbyszek Szmek
Changes by Zbyszek Szmek : -- nosy: +zbysz ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.o

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-19 Thread Tom Christiansen
Tom Christiansen added the comment: It appears that I'm right about surrogates, but wrong about noncharacters. I'm seeking a clarification there. --tom -- ___ Python tracker _

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-19 Thread Tom Christiansen
Tom Christiansen added the comment: No good news on the Java front. They do all kinds of things wrong. For example, they allow intermixed CESU-8 and UTF-8 in a real UTF-8 input stream, which is illegal. There's more they do wrong, including in their documentation, but I won't bore you with

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-19 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: Tom Christiansen wrote: > > I'm pretty sure that anything that claims to be UTF-{8,16,32} needs > to reject both surrogates *and* noncharacters. Here's something from the > published Unicode Standard's p.24 about noncharacter code points: > > • Nonch

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-19 Thread Tom Christiansen
Tom Christiansen added the comment: Ezio Melotti wrote on Mon, 19 Sep 2011 11:11:48 -: > We could also look at what other languages do and/or ask to the > Unicode consortium. I will look at what Java does a bit later on this morning, which is the only other commonly used language besi

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-19 Thread Ezio Melotti
Ezio Melotti added the comment: We could also look at what other languages do and/or ask to the Unicode consortium [0]. [0]: http://www.unicode.org/consortium/distlist.html -- ___ Python tracker

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-18 Thread Terry J. Reedy
Terry J. Reedy added the comment: My long-ago memory is that 'should not' is slightly looser in w3c parlance than 'must not'. However, it is a moot point if we decide to follow the 'should' in 3.3 for the default 'strict' mode, which both Ezio and I think we 'should' ;-). Our 'errors' parame

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-18 Thread Tom Christiansen
Tom Christiansen added the comment: "Terry J. Reedy" wrote on Thu, 08 Sep 2011 18:56:11 -: >On 9/8/2011 4:32 AM, Ezio Melotti wrote: >> So to summarize a bit, there are different possible level of strictness: >>1) all the possible encodable values, including the ones>10; >>

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-15 Thread Evgeny Kapun
Changes by Evgeny Kapun : -- nosy: +abacabadabacaba ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-08 Thread Terry J. Reedy
Terry J. Reedy added the comment: On 9/8/2011 4:32 AM, Ezio Melotti wrote: > So to summarize a bit, there are different possible level of strictness: >1) all the possible encodable values, including the ones>10; >2) values in range 0..10; >3) values in range 0..10 except

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-08 Thread Ezio Melotti
Ezio Melotti added the comment: So to summarize a bit, there are different possible level of strictness: 1) all the possible encodable values, including the ones >10; 2) values in range 0..10; 3) values in range 0..10 except surrogates (aka scalar values); 4) values in range

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-07 Thread Tom Christiansen
Tom Christiansen added the comment: Ezio Melotti wrote on Sat, 03 Sep 2011 00:28:03 -: > Ezio Melotti added the comment: > Or they are still called UTF-8 but used in combination with different error > handlers, like surrogateescape and surrogatepass. The "plain" UTF-* codecs > shoul

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-02 Thread Ezio Melotti
Ezio Melotti added the comment: Or they are still called UTF-8 but used in combination with different error handlers, like surrogateescape and surrogatepass. The "plain" UTF-* codecs should produce data that can be used for "open interchange", rejecting all the invalid data, both during enco

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-02 Thread Guido van Rossum
Guido van Rossum added the comment: > So the codec should allow for both public and private use. IIUC we have (or are planning) codecs that support the private use. They are not called "utf-8" though. -- ___ Python tracker

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-02 Thread Terry J. Reedy
Terry J. Reedy added the comment: Ezio, that is a lot of nice work to track down those pieces of the standard. I think the operative phrase in many of those quotes is 'open interchange'. Codecs are also used for private storage. If I use the unassigned or private-use code points in a private

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-02 Thread Ezio Melotti
Ezio Melotti added the comment: > To start with, no code point which when bitwise added with 0xFFFE > returns 0xFFFE can never appear in a valid UTF-* stream, but Python > allow this without any error. > That means that both 0xNN_FFFE and 0xNN_ are illegal in all > planes, where NN is 00

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-28 Thread Terry J. Reedy
Terry J. Reedy added the comment: > But I think we may want to create a new module which provides various APIs specifically for apps that need care when dealing with Unicode. I have started thinking that way too -- perhaps "unitools"? It could contain the code point iterator for the benefit of

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-28 Thread Guido van Rossum
Guido van Rossum added the comment: > PEP-393 will take care of iterating by code points. Only for CPython. IronPython/Jython will still need a separate solution. > Where would you have other iterators go? The string module? > Something else I have not thought of? Or something new? Undecided.

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-27 Thread Terry J. Reedy
Terry J. Reedy added the comment: Python makes it easy to transform a sequence with a generator as long as no look-ahead is needed. utf16.UTF16.__iter__ is a typical example. Whenever a surrogate is found, grab the matching one. However, grapheme clustering does require look-ahead, which is a

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-27 Thread Tom Christiansen
Tom Christiansen added the comment: Guido van Rossum wrote on Sat, 27 Aug 2011 03:26:21 -: > To me, making (default) iteration deviate from indexing is anathema. So long is there's a way to interate through a string some other way that by code unit, that's fine. However, the Java way

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-26 Thread Terry J. Reedy
Terry J. Reedy added the comment: PEP-393 will take care of iterating by code points. Where would you have other iterators go? The string module? Something else I have not thought of? Or something new? -- ___ Python tracker

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-26 Thread Guido van Rossum
Guido van Rossum added the comment: To me, making (default) iteration deviate from indexing is anathema. However, there is nothing wrong with providing a library function that takes a string and returns an iterator that iterates over code points, joining surrogate pairs as needed. You could eve

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-26 Thread Terry J. Reedy
Terry J. Reedy added the comment: My proposal is better than log(N) in 2 respects. 1) There need only be a time penalty when there are non-BMP chars and indexing currently gives the 'wrong' answer and therefore when a time-penalty should be acceptable. Lookup for normal all-BMP strings could

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-26 Thread Guido van Rossum
Guido van Rossum added the comment: Wow. A very educational discussion. We will be referencing this issue for many years to come. As long as the buck stops with me, I feel strongly that *today* changing indexing from O(1) to O(log N) is a bad idea, partly for technical reasons, partly beca

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-24 Thread Glenn Linderman
Glenn Linderman added the comment: In msg142098 Ezio said: > Keep in mind that we should be able to access and use lone surrogates too, > therefore: > s = '\ud800' # should be valid > len(s) # should this raise an error? (or return 0.5 ;)? I say: For streams and data types in which lone sur

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-23 Thread Terry J. Reedy
Changes by Terry J. Reedy : Added file: http://bugs.python.org/file23025/utf16.py ___ Python tracker ___ ___ Python-bugs-list mailing list Uns

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-22 Thread STINNER Victor
Changes by STINNER Victor : -- nosy: +haypo ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-22 Thread Terry J. Reedy
Changes by Terry J. Reedy : Removed file: http://bugs.python.org/file22900/utf16.py ___ Python tracker ___ ___ Python-bugs-list mailing list U

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-22 Thread Terry J. Reedy
Terry J. Reedy added the comment: I improved UTF16.__getitem__ to handle negative indexes and slices. The later uses the same adjustment as for indexes. An __iter__ method is not needed as str.__iter__ used __getitem__. I will take further discussion of this prototype to python-ideas list. -

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-15 Thread Matthew Barnett
Matthew Barnett added the comment: For what it's worth, I've had idea about string storage, roughly based on how *nix stores data on disk. If a string is small, point to a block of codepoints. If a string is medium-sized, point to a block of pointers to codepoint blocks. If a string is large

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-15 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: > Keep in mind that we should be able to access and use lone surrogates too, > therefore: > s = '\ud800' # should be valid > len(s) # should this raise an error? (or return 0.5 ;)? > s[0] # error here too? > list(s) # here too? > > p = s + '\udc00' > l

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Tom Christiansen
Tom Christiansen added the comment: Ezio Melotti wrote on Mon, 15 Aug 2011 04:56:55 -: > Another thing I noticed is that (at least on wide builds) surrogate pairs are > not joined "on the fly": > >>> p > '\ud800\udc00' > >>> len(p) > 2 > >>> p.encode('utf-16').decode('utf-16') > '𐀀' > >>

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Ezio Melotti
Ezio Melotti added the comment: Keep in mind that we should be able to access and use lone surrogates too, therefore: s = '\ud800' # should be valid len(s) # should this raise an error? (or return 0.5 ;)? s[0] # error here too? list(s) # here too? p = s + '\udc00' len(p) # 1? s[0] # '\U0

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Terry J. Reedy
Terry J. Reedy added the comment: >It is always better to deliver more than you say than to deliver less. Except when promising too little is a copout. >Everyone always talks about important they're sure O(1) access must be, I thought that too until your challenge. But now that you mention it,

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Tom Christiansen
Tom Christiansen added the comment: I wrote: >> Python's narrow builds are, in a sense, 'between' UCS-2 and UTF-16. > So I'm finding. Perhaps that's why I keep getting confused. I do have a > pretty firm > notion of what UCS-2 and UTF-16 are, and so I get sometimes > self-contradictory resu

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Tom Christiansen
Tom Christiansen added the comment: "Terry J. Reedy" wrote on Mon, 15 Aug 2011 00:26:53 -: > PS: The OSCON link in msg142036 currently gives me 404 not found Sorry, I wrote http://training.perl.com/OSCON/index.html but meant http://training.perl.com/OSCON2011/index.html

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Matthew Barnett
Matthew Barnett added the comment: Have a look here: http://98.245.80.27/tcpc/OSCON2011/gbu/index.html -- ___ Python tracker ___ ___

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Terry J. Reedy
Terry J. Reedy added the comment: Python's narrow builds are, in a sense, 'between' UCS-2 and UTF-16. They support non-BMP chars but only partially, because, BY DESIGN*, indexing and len are by code units, not codepoints. They are documented as being UCS-2 because that is what M-A Lemburg, th

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Terry J. Reedy
Terry J. Reedy added the comment: This is off-topic, but there was discussion on whether or not to have a 2.7. The decision was to focus on back-porting things that would make the eventual transition to 3.x easier. -- ___ Python tracker

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Terry J. Reedy
Terry J. Reedy added the comment: Tom, I appreciate your taking the time to help us improve our Unicode story. I agree that the compromises made a decade ago need to be revisited and revised. I think it will help if you better understand our development process. Our current *intent* is that '

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Ezio Melotti
Ezio Melotti added the comment: 2.7 is the last 2.x. There won't be any 2.8 (also I never heard that 2.6 was supposed to be the last). We already have 2.7.2, and we will continue with 2.7.3, 2.7.4, etc for a few more years. Eventually 2.7 will only get security fixes and the development wil

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Tom Christiansen
Tom Christiansen added the comment: Ezio Melotti wrote on Sun, 14 Aug 2011 17:46:55 -: >> I'm a bit confused on this. You no longer fix bugs in Python 2? > We do, but it's unlikely that we will introduce major changes in behavior. > Even if we had to get rid of narrow builds and/or

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Ezio Melotti
Ezio Melotti added the comment: > I'm a bit confused on this. You no longer fix bugs in Python 2? We do, but it's unlikely that we will introduce major changes in behavior. Even if we had to get rid of narrow builds and/or fix len(), we would probably only do it in the next 3.x version (i.e.

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Antoine Pitrou
Antoine Pitrou added the comment: > > The UTF-8 codec described by RFC 2279 didn't say so, so, since our > > codec was following RFC 2279, it was producing valid UTF-8. With RFC > > 3629 a number of things changed in a non-backward compatible way. > > Therefore we couldn't just change the behav

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Tom Christiansen
Tom Christiansen added the comment: Ezio Melotti wrote on Sun, 14 Aug 2011 07:15:09 -: > For example I don't think removing the 0x10 upper limit is going to > happen -- even if it might be useful for other things. I agree entirely. That's why I appended a triple exclamation poin

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Tom Christiansen
Tom Christiansen added the comment: Ezio Melotti wrote on Sun, 14 Aug 2011 07:15:09 -: >> Unicode says you can't put surrogates or noncharacters in a >> UTF-anything stream. It's a bug to do so and pretend it's a >> UTF-whatever. > The UTF-8 codec described by RFC 2279 didn't say so,

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Jeremy Kloth
Changes by Jeremy Kloth : -- nosy: +jkloth ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.o

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Ezio Melotti
Ezio Melotti added the comment: > If speed is more important than correctness, I can make any algorithm > infinitely fast. Given the choice between correct and quick, I will > take correct every single time. It's a trade-off. Using non-BMP chars is fairly unusual (many real-world applicatio

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Tom Christiansen
Tom Christiansen added the comment: Ezio Melotti added the comment: >> It is simply a design error to pretend that the number of characters >> is the number of code units instead of code points. A terrible and >> ugly one, but it does not mean you are UCS-2. > If you are referring to the val

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Ezio Melotti
Ezio Melotti added the comment: > It is simply a design error to pretend that the number of characters > is the number of code units instead of code points. A terrible and > ugly one, but it does not mean you are UCS-2. If you are referring to the value returned by len(unicode_string), it is t

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Tom Christiansen
Tom Christiansen added the comment: >> Here's why I say that Python uses UTF-16 not UCS-2 on its narrow builds. >> Perhaps someone could tell me why the Python documentation says it uses >> UCS-2 on a narrow build. > There's a disagreement on that point between several developers. > See an exa

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Matthew Barnett
Matthew Barnett added the comment: You're right about starting the second search from where the first finished. Caching the position would be an advantage there. The memory cost of extra pointers wouldn't be so bad if UTF-8 took less space than the current format. Regex isn't used as much as

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Tom Christiansen
Tom Christiansen added the comment: Antoine Pitrou wrote on Sat, 13 Aug 2011 21:09:52 -: > And/or a lookup table giving the byte offset of, say, every 16th > character. It gives you a O(1) lookup with a relatively reasonable > constant cost (you have to scan for less than 16 characters

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Tom Christiansen
Tom Christiansen added the comment: Matthew Barnett wrote on Sat, 13 Aug 2011 20:57:40 -: > There are occasions when you want to do string slicing, often of the form: > pos = my_str.index(x) > endpos = my_str.index(y) > substring = my_str[pos : endpos] Me, I would probably give

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Antoine Pitrou
Antoine Pitrou added the comment: > There are occasions when you want to do string slicing, often of the form: > > pos = my_str.index(x) > endpos = my_str.index(y) > substring = my_str[pos : endpos] > > To me that suggests that if UTF-8 is used then it may be worth > profiling to see whether c

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Matthew Barnett
Matthew Barnett added the comment: There are occasions when you want to do string slicing, often of the form: pos = my_str.index(x) endpos = my_str.index(y) substring = my_str[pos : endpos] To me that suggests that if UTF-8 is used then it may be worth profiling to see whether caching the las

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Antoine Pitrou
Antoine Pitrou added the comment: > Here's why I say that Python uses UTF-16 not UCS-2 on its narrow builds. > Perhaps someone could tell me why the Python documentation says it uses > UCS-2 on a narrow build. There's a disagreement on that point between several developers. See an example sub-

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Tom Christiansen
Tom Christiansen added the comment: David Murray wrote: > Tom, note that nobody is arguing that what you are requesting is a bad > thing :) There looked to be minor some resistance, based on absolute backwards compatibility even if wrong, regarding changing anything *at all* in re, even thing

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread R. David Murray
R. David Murray added the comment: Tom, note that nobody is arguing that what you are requesting is a bad thing :) As far as I know, Matthew is the only one currently working on the regex support in Python. (Other developers will commit small fixes if someone proposes a patch, but no one tha

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-12 Thread Matthew Barnett
Matthew Barnett added the comment: In a narrow build, a codepoint in the astral plane is encoded as surrogate pair. I could implement a workaround for it in the regex module, but I think that the proper place to fix it is in the language as a whole, perhaps by implementing PEP 393 ("Flexible

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-12 Thread Matthew Barnett
Changes by Matthew Barnett : -- nosy: +mrabarnett ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.p

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-12 Thread Tom Christiansen
Tom Christiansen added the comment: "Terry J. Reedy" wrote on Fri, 12 Aug 2011 22:21:59 -: > Does the regex module handle these particular issues better? No, it currently does not. One would have to ask Matthew directly, but I believe it was because he was trying to stay compatible w

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-12 Thread Terry J. Reedy
Terry J. Reedy added the comment: Does the regex module handle these particular issues better? -- nosy: +terry.reedy type: behavior -> feature request versions: +Python 3.3 -Python 2.7 ___ Python tracker _

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-12 Thread Arfrever Frehtes Taifersar Arahesis
Changes by Arfrever Frehtes Taifersar Arahesis : -- nosy: +Arfrever ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscri

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-11 Thread R. David Murray
R. David Murray added the comment: This is an acknowledged problem with Python narrow builds, and applies to much more than just regex processing. -- nosy: +r.david.murray ___ Python tracker _

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-11 Thread Tom Christiansen
New submission from Tom Christiansen : Python is in flagrant violation of the very most basic premises of Unicode Technical Report #18 on Regular Expressions, which requires that a regex engine support Unicode characters as "basic logical units independent of serialization like UTF‑*". Becaus