Tom Christiansen added the comment:
>Martin v. L=C3=B6wis added the comment:
>> I would encourage you to look at the Perl CPAN module Unicode::LineBreak,
>> which fully implements tr11.
>Thanks for the pointer!
>> If you'd like, I can show you a program t
Tom Christiansen added the comment:
>Martin v. L=C3=B6wis added the comment:
>> Martin, I think you meant to write "if w =3D=3D 'A':".
>> Some very common characters have ambiguous widths though (e.g. the Greek =
>alphabet), so you can't just raise
Tom Christiansen added the comment:
I would encourage you to look at the Perl CPAN module Unicode::LineBreak,
which fully implements tr11. It includes Unicode::GCString, a class
that has a columns() method to determine the print columns. This is very
fancy in the case of Asian widths, but of
Tom Christiansen added the comment:
Yes, it looks good. Thank you very much.
-tom
--
___
Python tracker
<http://bugs.python.org/issue12753>
___
___
Python-bug
Tom Christiansen added the comment:
> Martin v. Löwis added the comment:
> I think the WideCharToMultibyte approach is just incorrect.
> I'm -1 on using wcswidth, though.
Like you, I too seriously question using wcswidth() for this at all:
The wcswidth() function either
Tom Christiansen added the comment:
Ezio Melotti wrote
on Sun, 09 Oct 2011 13:21:00 -:
> Here is a new patch that stores the names of aliases and named
> sequences in the Private Use Area.
Looks good! Thanks!
--tom
--
title: \N{...} neglects formal aliases and
Tom Christiansen added the comment:
Ezio Melotti wrote
on Mon, 03 Oct 2011 04:15:51 -:
>> But it still has to happen at compile time, of course, so I don't know
>> what you could do in Python. Is there any way to change how the compiler
>> behaves even vaguely
Tom Christiansen added the comment:
>> Really? White space makes things harder to read? I thought Pythonistas
>> believed the opposite of that.
> I was surprised at that too ;-). One person's opinion in a specific
> context. Don't generalize.
The example I init
Tom Christiansen added the comment:
Ezio Melotti wrote
on Sun, 02 Oct 2011 06:46:26 -:
> Actually Python doesn't seem to support \N{LINE FEED (LF)}, most likely bec=
> ause that's a Unicode 1 name, and nowadays these codepoints are simply mark=
> ed as ''.
Tom Christiansen added the comment:
>> Perl does not provide the old 1.0 names at all. We don't have a Unicode
>> 1.0 legacy to support, which makes this cleaner. However, we do provide
>> for the names of the C0 and C1 Control Codes, because apart from Unicode
>&g
Tom Christiansen added the comment:
Martin v. Löwis wrote
on Sat, 01 Oct 2011 10:59:48 -:
>> * Word characters are Alphabetic + Mn+Mc+Me + Nd + Pc.
> Where did you get that definition from? UTS#18 defines
> "", which is Alphabetic + U+200C + U+200D
> (i.e.
Tom Christiansen added the comment:
>Ezio Melotti added the comment:
> Leaving named sequences for unicodedata.lookup() only (and not for
> \N{}) makes sense.
There are certainly advantages to that strategy: you don't have to
deal with [\N{sequence}] issues. If t
Tom Christiansen added the comment:
> Martin v. Löwis added the comment:
> "Split S into words. Change the first letter in a word to upper-case,
Except that I think you actually mean that the first "letter" is
changed into titlecase not uppercase.
One might also sa
Tom Christiansen added the comment:
It appears that I'm right about surrogates, but wrong about
noncharacters. I'm seeking a clarification there.
--tom
--
___
Python tracker
<http://bugs.python.o
Tom Christiansen added the comment:
No good news on the Java front. They do all kinds of things wrong.
For example, they allow intermixed CESU-8 and UTF-8 in a real UTF-8
input stream, which is illegal. There's more they do wrong, including
in their documentation, but I won'
Tom Christiansen added the comment:
Ezio Melotti wrote
on Mon, 19 Sep 2011 11:11:48 -:
> We could also look at what other languages do and/or ask to the
> Unicode consortium.
I will look at what Java does a bit later on this morning, which is the
only other commonly used la
Tom Christiansen added the comment:
"Terry J. Reedy" wrote
on Thu, 08 Sep 2011 18:56:11 -:
>On 9/8/2011 4:32 AM, Ezio Melotti wrote:
>> So to summarize a bit, there are different possible level of strictness:
>>1) all the possible encodable values,
Tom Christiansen added the comment:
Ezio Melotti wrote
on Sat, 03 Sep 2011 00:28:03 -:
> Ezio Melotti added the comment:
> Or they are still called UTF-8 but used in combination with different error
> handlers, like surrogateescape and surrogatepass. The "plain
Tom Christiansen added the comment:
Antoine Pitrou wrote
on Mon, 29 Aug 2011 13:21:06 -:
> It's not only "typographically speaking", it's really a spelling error,
> even in hand-written text :-)
Sure, and so too is omitting an accent mark or diaeresi
Tom Christiansen added the comment:
Antoine Pitrou wrote on Sat, 27 Aug 2011 20:04:56
-:
>> Neither am I. Even in "old-style" English with ae and oe, one wrote
>> ÆGYPT and ÆSIR all caps but Ægypt and Æsir in titlecase, not *Aegypt or
>> *Aesir. Simi
Tom Christiansen added the comment:
Guido van Rossum wrote
on Sat, 27 Aug 2011 16:15:33 -:
> Although personally I don't have much of an intuition for what
> titlecase means (and why it's important), perhaps because I'm not
> familiar with any language where t
Tom Christiansen added the comment:
Guido van Rossum wrote
on Fri, 26 Aug 2011 21:11:24 -:
> Would this also affect .islower() and friends?
SHORT VERSION: (7 lines)
I don't believe so, but the relationship between lower() and islower()
is not as clear to me as I wo
Tom Christiansen added the comment:
Guido van Rossum wrote
on Sat, 27 Aug 2011 03:26:21 -:
> To me, making (default) iteration deviate from indexing is anathema.
So long is there's a way to interate through a string some other way
that by code unit, that's fine. Howe
Tom Christiansen added the comment:
Here’s my casing test suite; I thought I sent it in but the mux file here isn’t
the full thing.
It does several things, including letting you run it with regex vs re. It
also checks for the islower, etc functions. It has both simple and full (and
turkic
Tom Christiansen added the comment:
Guido van Rossum wrote
on Fri, 26 Aug 2011 21:11:24 -:
> Guido van Rossum added the comment:
> I presume this applies to builtin str methods like .lower(), right? I
> think it is a good thing to do for Python 3.3.
Yes, the full casemap
Tom Christiansen added the comment:
Guido van Rossum wrote
on Fri, 26 Aug 2011 21:55:03 -:
> I know I sound like NIH, but I'm always reluctant to add a big 3rd
> party lib like ICU to the permanent dependencies of all future Python
> distros. If people want to use IC
Tom Christiansen added the comment:
I should probably mention the importance in the design of a UCA module of
being able to specify which UCA version number you want it to behave like
in case you plan to override some of the DUCET entries. That way if you
run under a later UCA with different
Tom Christiansen added the comment:
Raymond Hettinger added the comment:
> I would like to be involved in the design of the API for a UCA module
> and its routines for loading Unicode Collation Element Tables (not
> making the mistake of using global state like the locale module d
Tom Christiansen added the comment:
Guido van Rossum wrote
on Fri, 26 Aug 2011 21:16:57 -:
> Yeah, this should be fixed in 3.3 and probably backported to 3.2
> and 2.7. (There is already no guarantee that len(s) ==
> len(s.title()), right?)
Well, *I* don't know of any
Tom Christiansen added the comment:
> Sounds like a fair feature request for Python 3.3, as long as the
> intention is that users must import some module from the standard
> library and use functions defined in that module. The operations and
> methods defined for str in
Tom Christiansen added the comment:
Matthew Barnett wrote
on Fri, 19 Aug 2011 23:36:45 -:
> For the "Line_Break" property, one of the possible values is
> "Inseparable", with 2 permitted aliases, the shorter "IN" (which
> is reasonable) and &qu
Tom Christiansen added the comment:
"Terry J. Reedy" wrote
on Fri, 19 Aug 2011 22:50:58 -:
> My current opinion is that adding the aliases might be done in current
> releases. It certainly would serve the any user who does not know to
> misspell 'FTHORA'
Tom Christiansen added the comment:
Marc-Andre Lemburg wrote
on Tue, 16 Aug 2011 12:11:22 -:
> The reasoning behind e.g. "ISSURROGATE" is that those names originate
> from and are consistent with the already existing ISLOWER/ISUPPER/ISTITLE
> macros which in ret
Tom Christiansen added the comment:
Ezio Melotti wrote
on Tue, 16 Aug 2011 09:23:50 -:
> All the other macros[0] follow the same convention, e.g. Py_UNICODE_ISLOWER
> and Py_UNICODE_TOLOWER. I agree that keeping the words separate makes them
> more readable though.
&
Tom Christiansen added the comment:
Antoine Pitrou wrote
on Tue, 16 Aug 2011 09:18:46 -:
>> I think the 4 macros:
>> #define _Py_UNICODE_ISSURROGATE
>> #define _Py_UNICODE_ISHIGHSURROGATE
>> #define _Py_UNICODE_ISLOWSURROGATE
>> #define _Py_UNICOD
Tom Christiansen added the comment:
I now see there are lots of good things in the BOM FAQ that have come up
lately regarding surrogates and other illegal characters, and about what
can go in data streams.
I quote a few of these from http://unicode.org/faq/utf_bom.html below:
Q: How do
Tom Christiansen added the comment:
>Ezio Melotti added the comment:
>I think the 4 macros:
> #define _Py_UNICODE_ISSURROGATE
> #define _Py_UNICODE_ISHIGHSURROGATE
> #define _Py_UNICODE_ISLOWSURROGATE
> #define _Py_UNICODE_JOIN_SURROGATES
>are quite straightforward an
Changes by Tom Christiansen :
Removed file: http://bugs.python.org/file22902/nametests.py
___
Python tracker
<http://bugs.python.org/issue12734>
___
___
Python-bugs-list m
Tom Christiansen added the comment:
Here’s the right test file for the right ticket.
--
Added file: http://bugs.python.org/file22903/nametests.py
___
Python tracker
<http://bugs.python.org/issue12
Tom Christiansen added the comment:
>Terry J. Reedy added the comment:
>Adding Symbola filled in the symbols and emoticons lines.
>The gothic chars are still missing even with Alfios.
That's too bad, as the Gothic paternoster is kinda cute. :)
Hm, I wonder where I got them
Tom Christiansen added the comment:
Oh whoops, that was the long ticket. Shall I reupload to the right number?
--
___
Python tracker
<http://bugs.python.org/issue12
Tom Christiansen added the comment:
Sorry I didn't include a test case. Hope this makes up for it. If not, please
tell me how to write better test cases. :(
Yeah ok, so I'm a bit persnickety or even unorthodox about my vertical
alignment, but it really helps to make what is diff
Tom Christiansen added the comment:
>Terry J. Reedy added the comment:
> You are right, FF switched on me without notice. Bad FF. Thank you! What
> I now see makes much more sense.
>[ "𐐼𐐯𐑅𐐨𐑉𐐯𐐻", "𐐼𐐯𐑅𐐨𐑉𐐯𐐻", "𐐔𐐯𐑅𐐨𐑉𐐯𐐻", "𐐔
Tom Christiansen added the comment:
>Terry J. Reedy added the comment:
> My Firefox is already set at utf-8. More likely a font limitation. I
> will look again after installing one of the fonts Tom suggested.
Symbola is best for exotic glyphs, especially astral ones.
Alfios just l
New submission from Tom Christiansen :
Unicode character names share a common namespace with formal aliases and with
named sequences, but Python recognizes only the original name. That means not
everything in the namespace is accessible from Python. (If this is construed
to be an extant bug
Changes by Tom Christiansen :
--
nosy: +tchrist
___
Python tracker
<http://bugs.python.org/issue12746>
___
___
Python-bugs-list mailing list
Unsubscribe:
Tom Christiansen added the comment:
Ezio Melotti wrote on Mon, 15 Aug 2011 04:56:55 -:
> Another thing I noticed is that (at least on wide builds) surrogate pairs are
> not joined "on the fly":
> >>> p
> '\ud800\udc00'
> >>> len(p
Tom Christiansen added the comment:
I wrote:
>> Python's narrow builds are, in a sense, 'between' UCS-2 and UTF-16.
> So I'm finding. Perhaps that's why I keep getting confused. I do have a
> pretty firm
> notion of what UCS-2 and UTF-16 are, and
Tom Christiansen added the comment:
"Terry J. Reedy" wrote
on Mon, 15 Aug 2011 00:26:53 -:
> PS: The OSCON link in msg142036 currently gives me 404 not found
Sorry, I wrote
http://training.perl.com/OSCON/index.html
but meant
http://training.perl.
Tom Christiansen added the comment:
Ezio Melotti wrote
on Sun, 14 Aug 2011 17:46:55 -:
>> I'm a bit confused on this. You no longer fix bugs in Python 2?
> We do, but it's unlikely that we will introduce major changes in behavior.
> Even if we had to get rid o
Tom Christiansen added the comment:
Ezio Melotti wrote on Sun, 14 Aug 2011 17:15:52 -:
>> You're right: my wide build is not Python3, just Python2.
> And is it failing? Here the tests pass on the wide builds, on both Python 2
> and 3.
Perhaps I am doing something
Tom Christiansen added the comment:
Ezio Melotti wrote
on Sun, 14 Aug 2011 07:15:09 -:
> For example I don't think removing the 0x10 upper limit is going to
> happen -- even if it might be useful for other things.
I agree entirely. That's why I appended a tr
Tom Christiansen added the comment:
>Ezio Melotti added the comment:
>On wide 3.2 it passes too, so the failure is limited to narrow builds (are =
>you sure that it fails on wide builds for you?).
You're right: my wide build is not Python3, just Python2. In fact,
it's
Tom Christiansen added the comment:
Ezio Melotti wrote
on Sun, 14 Aug 2011 07:15:09 -:
>> Unicode says you can't put surrogates or noncharacters in a
>> UTF-anything stream. It's a bug to do so and pretend it's a
>> UTF-whatever.
> The UTF-8 codec
New submission from Tom Christiansen :
On neither narrow nor wide builds does this UTF8-encoded bit run without
raising an exception:
if re.search("[𝒜-𝒵]", "𝒞", re.UNICODE):
print("match 1 passed")
else:
print("match 2 failed")
Tom Christiansen added the comment:
Ezio Melotti added the comment:
>> It is simply a design error to pretend that the number of characters
>> is the number of code units instead of code points. A terrible and
>> ugly one, but it does not mean you are UCS-2.
> If you
Tom Christiansen added the comment:
>> Here's why I say that Python uses UTF-16 not UCS-2 on its narrow builds.
>> Perhaps someone could tell me why the Python documentation says it uses
>> UCS-2 on a narrow build.
> There's a disagreement on that point betwee
Tom Christiansen added the comment:
Antoine Pitrou wrote
on Sat, 13 Aug 2011 21:09:52 -:
> And/or a lookup table giving the byte offset of, say, every 16th
> character. It gives you a O(1) lookup with a relatively reasonable
> constant cost (you have to scan for les
Tom Christiansen added the comment:
Matthew Barnett wrote
on Sat, 13 Aug 2011 20:57:40 -:
> There are occasions when you want to do string slicing, often of the form:
> pos = my_str.index(x)
> endpos = my_str.index(y)
> substring = my_str[pos : endpos]
Me, I wo
Tom Christiansen added the comment:
David Murray wrote:
> Tom, note that nobody is arguing that what you are requesting is a bad
> thing :)
There looked to be minor some resistance, based on absolute backwards
compatibility even if wrong, regarding changing anything *at all* in re
Tom Christiansen added the comment:
"Terry J. Reedy" wrote
on Fri, 12 Aug 2011 23:05:27 -:
> Ouch!
> Do the rejected characters qualify as identifier characters as defined
> in Reference 2.3 Identifiers and keywords?
> http://docs.python.org/py3k/reference
Tom Christiansen added the comment:
> Terry J. Reedy added the comment:
> However desireable it would be, I do not believe there is any claim in the =
> manual that the re module follows the evolving Unicode consortium r.e. stan=
My from the hip thought is that if re cannot be
Tom Christiansen added the comment:
"Terry J. Reedy" wrote
on Fri, 12 Aug 2011 22:21:59 -:
> Does the regex module handle these particular issues better?
No, it currently does not. One would have to ask Matthew directly, but I
believe it was because he was trying to st
Tom Christiansen added the comment:
> Terry J. Reedy added the comment:
> I am not sure that everyone will agree that this is a bug, rather than a fe=
> ature request, or that if a bug, that it should be changed in existing rele=
> ases and possibly break running code. The d
Tom Christiansen added the comment:
Whoops, I meant that it appears that Python runs its identifiers through NFC.
How that gets along with a filesystem that has quasi-NFD filenames I'm not
sure, but it seems like it might be a variant of the case-insensitivity issue
in file
Tom Christiansen added the comment:
Please do not call this "utf-8-java". It is called "cesu-8" per UTS#18 at:
http://unicode.org/reports/tr26/
CESU-8 is *not* a a valid Unicode Transform Format and should not be called
UTF-8. It is a real pain in the butt, caused by pe
Tom Christiansen added the comment:
How does this work for modules that have filesystem names different from the
one used for import? The issue I'm thinking about is that the Mac HSF+
filesystem keeps its Unicode filenames in (close to) NFD form. That means that
a module named "c
Tom Christiansen added the comment:
I can attest that being able to get the columns of a grapheme cluster is very
important for printing, because you need this to do correct linebreaking.
There might be something you can steal from
http://search.cpan.org/perldoc?Unicode::GCString
Tom Christiansen added the comment:
I've been a lot of testing of Matthew's regex library against UTS#18 issues,
but only somewhat incidentally testing re. To use regex, one has to accept that
certain things will work differently than they work in re, because he is
followi
New submission from Tom Christiansen :
Python's string.title() function claims it titlecases the first letter in each
word and lowercases the rest. However, this is not true. It is not using
either of the two word detection algorithms that Unicode provides. One allows
you to use a l
New submission from Tom Christiansen :
Python's casemapping functions only use what Unicode calls simple casemaps.
These are only appropriate for functions that operate on single characters
alone, not for those that operate on strings. The reason for this is that you
get much better re
New submission from Tom Christiansen :
Python has no standard support for the Unicode Collation Library as explained
in UTS #10. This is request that UCA library be added to the standard Python
distribution.
Collation underlies virtually everything we do with text, not just sorting but
any
New submission from Tom Christiansen :
Python supports no Unicode properties in its re library, making it unsuitable
for work with Unicode. This is therefore a formal request for the Python re
library to support Unicode properties.
The eleven properties required by Unicode Technical Report
New submission from Tom Christiansen :
Without proper grapheme support in the regular expression library, it is
impossible to correctly process Unicode. And the very least, one needs the \X
escape supported, which is an extended grapheme cluster per UTS#18. This escape
is supported by many
Changes by Tom Christiansen :
--
components: +Regular Expressions -Library (Lib)
type: -> behavior
___
Python tracker
<http://bugs.python.org/issue12728>
___
_
New submission from Tom Christiansen :
You cannot reliably use Unicode in Python identifiers because of the
narrow/wide build issue. The enclosed file is fine on wide builds but gets
compiler errors on narrow ones during compilation.
Go, Ruby, Java, and Perl all handle this situation without
New submission from Tom Christiansen :
You cannot use Python's lib re for handling Unicode regular expressions because
it violates the standard set out for the same in UTS#18 on Unicode Regular
Expressions in RL1.2a on compatibility properties. What \w is allowed to match
is cl
New submission from Tom Christiansen :
You cannot use Python's casemapping functions on Unicode data because they fail
on narrow builds. This makes it impossible to write portable code in Python
that can cope with full Unicode.
I've tried several times to submit this bug, bu
New submission from Tom Christiansen :
Python is in flagrant violation of the very most basic premises of Unicode
Technical Report #18 on Regular Expressions, which requires that a regex engine
support Unicode characters as "basic logical units independent of serialization
lik
New submission from Tom Christiansen :
The Python re library is broken in its approach to case-insensitive matches. It
erroneously attempts to compare lowercase mappings. This is wrong. You must
compare the Unicode casefolds, not the Unicode casemaps. Otherwise you get
wrong answers. I
80 matches
Mail list logo