On 19/08/12 19:48:06, Paul Rubin wrote:
> Terry Reedy writes:
>> py> s = chr(0x + 1)
>> py> a, b = s
> That looks like a 3.2- narrow build. Such which treat unicode strings
> as sequences of code units rather than sequences of codepoints. Not an
> implementation bug, but compromise d
Steven D'Aprano:
Using variable-sized strings like UTF-8 and UTF-16 for in-memory
representations is a terrible idea because you can't assume that people
will only every want to index the first or last character. On average,
you need to scan half the string, one character at a time. In Big-Oh, w
"Blind Anagram" writes:
> This is an average slowdown by a factor of close to 2.3 on 3.3 when
> compared with 3.2.
>
> I am not posting this to perpetuate this thread but simply to ask
> whether, as you suggest, I should report this as a possible problem with
> the beta?
Being a beta release, is
Steven D'Aprano writes:
> Paul Rubin already told you about his experience using OCR to generate
> multiple terrabytes of text, and how he would not be happy if that was
> stored in UCS-4.
That particular text was stored on disk as compressed XML that had UTF-8
in the data fields, but I think R
On Aug 19, 11:11 pm, wxjmfa...@gmail.com wrote:
> Le dimanche 19 août 2012 19:48:06 UTC+2, Paul Rubin a écrit :
>
>
>
> > But they are not ascii pages, they are (as stated) MOSTLY ascii.
>
> > E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses
>
> > a much more memory-expensive enco
On Mon, 20 Aug 2012 00:44:22 -0400, Roy Smith wrote:
> In article <5031bb2f$0$29972$c3e8da3$54964...@news.astraweb.com>,
> Steven D'Aprano wrote:
>
>> > So it may be with utf-8 someday.
>>
>> Only if you believe that people's ability to generate data will remain
>> lower than people's ability
In article <5031bb2f$0$29972$c3e8da3$54964...@news.astraweb.com>,
Steven D'Aprano wrote:
> > So it may be with utf-8 someday.
>
> Only if you believe that people's ability to generate data will remain
> lower than people's ability to install more storage.
We're not talking *data*, we're talki
On Sun, 19 Aug 2012 19:24:30 -0400, Roy Smith wrote:
> In the primordial days of computing, using 8 bits to store a character
> was a profligate waste of memory. What on earth did people need with
> TWO cases of the alphabet
That's obvious, surely? We need two cases so that we can distinguish
On Mon, Aug 20, 2012 at 10:35 AM, Terry Reedy wrote:
> On 8/19/2012 6:42 PM, Chris Angelico wrote:
>> However, Python goes a bit further by making it VERY clear that this
>> is a mere optimization, and that Unicode strings and bytes strings are
>> completely different beasts. In Pike, it's possibl
On 8/19/2012 6:42 PM, Chris Angelico wrote:
On Mon, Aug 20, 2012 at 3:34 AM, Terry Reedy wrote:
Python has often copied or borrowed, with adjustments. This time it is the
first.
I should have added 'that I know of' ;-)
Maybe it wasn't consciously borrowed, but whatever innovation is done,
On Monday, August 20, 2012 1:03:34 AM UTC+8, Blind Anagram wrote:
> "Steven D'Aprano" wrote in message
>
> news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com...
>
>
>
> On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:
>
>
>
> [...]
>
> If you can consistently replicate a 100% to
In article ,
Chris Angelico wrote:
> Really, the only viable alternative to PEP 393 is a fixed 32-bit
> representation - it's the only way that's guaranteed to provide
> equivalent semantics. The new storage format is guaranteed to take no
> more memory than that, and provide equivalent function
On Mon, Aug 20, 2012 at 3:34 AM, Terry Reedy wrote:
> On 8/19/2012 4:04 AM, Paul Rubin wrote:
>> I realize the folks who designed and implemented PEP 393 are very smart
>> cookies and considered stuff carefully, while I'm just an internet user
>> posting an immediate impression of something I hadn
On 8/19/2012 2:11 PM, wxjmfa...@gmail.com wrote:
Well, it seems some software producers know what they
are doing.
'€'.encode('cp1252')
b'\x80'
'€'.encode('mac-roman')
b'\xdb'
'€'.encode('iso-8859-1')
Traceback (most recent call last):
File "", line 1, in
UnicodeEncodeError: 'latin-1'
On 8/19/2012 1:03 PM, Blind Anagram wrote:
Running Python from a Windows command prompt, I got the following on
Python 3.2.3 and 3.3 beta 2:
python33\python" -m timeit "('abc' * 1000).replace('c', 'de')"
1 loops, best of 3: 39.3 usec per loop
python33\python" -m timeit "('ab…' * 1000).repl
On Sun, 19 Aug 2012 18:03:34 +0100, Blind Anagram wrote:
> "Steven D'Aprano" wrote in message
> news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com...
>
> > If you can consistently replicate a 100% to 1000% slowdown in string
> > handling, please report it as a performance bug:
> >
> > htt
On Sun, 19 Aug 2012 11:50:12 -0600, Ian Kelly wrote:
> On Sun, Aug 19, 2012 at 12:33 AM, Steven D'Aprano
> wrote:
[...]
>> The PEP explicitly states that it only uses a 1-byte format for ASCII
>> strings, not Latin-1:
>
> I think you misunderstand the PEP then, because that is empirically
> fals
On Sun, 19 Aug 2012 10:48:06 -0700, Paul Rubin wrote:
> Terry Reedy writes:
>> I would call it O(k), where k is a selectable constant. Slowing access
>> by a factor of 100 is hardly acceptable to me.
>
> If k is constant then O(k) is the same as O(1). That is how O notation
> works.
You might
Ian Kelly writes:
print (type(bytes(range(256)).decode('latin1')))
>
Thanks.
--
http://mail.python.org/mailman/listinfo/python-list
On 19/08/2012 19:11, wxjmfa...@gmail.com wrote:
Le dimanche 19 août 2012 19:48:06 UTC+2, Paul Rubin a écrit :
But they are not ascii pages, they are (as stated) MOSTLY ascii.
E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses
a much more memory-expensive encoding than UTF-8.
On Sun, Aug 19, 2012 at 11:50 AM, Ian Kelly wrote:
> Note that this only describes the structure of "compact" string
> objects, which I have to admit I do not fully understand from the PEP.
> The wording suggests that it only uses the PyASCIIObject structure,
> not the derived structures. It the
On Sun, Aug 19, 2012 at 12:20 PM, Paul Rubin wrote:
> Ian Kelly writes:
> sys.getsizeof(bytes(range(256)).decode('latin1'))
>> 329
>
> Please try:
>
>print (type(bytes(range(256)).decode('latin1')))
>
> to make sure that what comes back is actually a unicode string rather
> than a byte st
Ian Kelly writes:
sys.getsizeof(bytes(range(256)).decode('latin1'))
> 329
Please try:
print (type(bytes(range(256)).decode('latin1')))
to make sure that what comes back is actually a unicode string rather
than a byte string.
--
http://mail.python.org/mailman/listinfo/python-list
Le dimanche 19 août 2012 19:48:06 UTC+2, Paul Rubin a écrit :
>
>
> But they are not ascii pages, they are (as stated) MOSTLY ascii.
>
> E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses
>
> a much more memory-expensive encoding than UTF-8.
>
>
Imagine an us banking applicat
On 08/19/2012 01:03 PM, Blind Anagram wrote:
> "Steven D'Aprano" wrote in message
> news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com...
>
> On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:
>
> [...]
> If you can consistently replicate a 100% to 1000% slowdown in string
> handling, plea
wrote in message
news:5dfd1779-9442-4858-9161-8f1a06d56...@googlegroups.com...
Le dimanche 19 août 2012 19:03:34 UTC+2, Blind Anagram a écrit :
"Steven D'Aprano" wrote in message
news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com...
On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote
On Sun, Aug 19, 2012 at 12:33 AM, Steven D'Aprano
wrote:
> On Sat, 18 Aug 2012 09:51:37 -0600, Ian Kelly wrote about PEP 393:
>> There is some additional benefit for Latin-1 users, but this has nothing
>> to do with Python. If Python is going to have the option of a 1-byte
>> representation (and
Terry Reedy writes:
>> Meanwhile, an example of the 393 approach failing:
> I am completely baffled by this, as this example is one where the 393
> approach potentially wins.
What? The 393 approach is supposed to avoid memory bloat and that
does the opposite.
>> I was involved in a project that
Le dimanche 19 août 2012 19:03:34 UTC+2, Blind Anagram a écrit :
> "Steven D'Aprano" wrote in message
>
> news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com...
>
>
>
> On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:
>
>
>
> [...]
>
> If you can consistently replicate a 100% to
On 8/19/2012 4:04 AM, Paul Rubin wrote:
Meanwhile, an example of the 393 approach failing:
I am completely baffled by this, as this example is one where the 393
approach potentially wins.
I was involved in a
project that dealt with terabytes of OCR data of mostly English text.
So the char
"Steven D'Aprano" wrote in message
news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com...
On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:
[...]
If you can consistently replicate a 100% to 1000% slowdown in string
handling, please report it as a performance bug:
http://bugs.python.or
On 8/19/2012 4:54 AM, wxjmfa...@gmail.com wrote:
About the exemples contested by Steven:
eg: timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
And it is good enough to show the problem. Period.
Repeating a false claim over and over does not make it true. Two people
on pydev claim that 3.3 is *f
On 19/08/12 15:25, Steven D'Aprano wrote:
Not necessarily. Presumably you're scanning each page into a single
string. Then only the pages containing a supplementary plane char will be
bloated, which is likely to be rare. Especially since I don't expect your
OCR application would recognise many n
On Sun, 19 Aug 2012 01:04:25 -0700, Paul Rubin wrote:
> Steven D'Aprano writes:
>> This standard data structure is called UCS-2 ... There's an extension
>> to UCS-2 called UTF-16
>
> My own understanding is UCS-2 simply shouldn't be used any more.
Pretty much. But UTF-16 with lax support for
On Sun, 19 Aug 2012 01:11:56 -0700, Paul Rubin wrote:
> Steven D'Aprano writes:
>> result = text[end:]
>
> if end not near the end of the original string, then this is O(N) even
> with fixed-width representation, because of the char copying.
Technically, yes. But it's a straight copy of a c
On 19/08/12 11:19, Chris Angelico wrote:
On Sun, Aug 19, 2012 at 8:13 PM, lipska the kat
wrote:
The date stamp is different but the Python version is the same
Check out what 'sys.maxunicode' is in each of those Pythons. It's
possible that one is a wide build and the other narrow.
Ah ...
I
On 19/08/2012 09:54, wxjmfa...@gmail.com wrote:
About the exemples contested by Steven:
eg: timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
And it is good enough to show the problem. Period. The
rest (you have to do this, you should not do this, why
are you using these characters - amazing an
On Sun, Aug 19, 2012 at 8:13 PM, lipska the kat
wrote:
> The date stamp is different but the Python version is the same
Check out what 'sys.maxunicode' is in each of those Pythons. It's
possible that one is a wide build and the other narrow.
ChrisA
--
http://mail.python.org/mailman/listinfo/pyt
On 19/08/12 07:09, Steven D'Aprano wrote:
This is a long post. If you don't feel like reading an essay, skip to the
very bottom and read my last few paragraphs, starting with "To recap".
Thank you for this excellent post,
it has certainly cleared up a few things for me
[snip]
incidentally
>
Le dimanche 19 août 2012 10:56:36 UTC+2, Steven D'Aprano a écrit :
>
> internal implementation, and strings which fit exactly in Latin-1 will
>
And this is the crucial point. latin-1 is an obsolete and non usable
coding scheme (esp. for european languages).
We fall on the point I mentionned ab
On Sun, 19 Aug 2012 09:43:13 +0200, Peter Otten wrote:
> Steven D'Aprano wrote:
>> I don't know where people are getting this myth that PEP 393 uses
>> Latin-1 internally, it does not. Read the PEP, it explicitly states
>> that 1-byte formats are only used for ASCII strings.
>
> From
>
> Python
About the exemples contested by Steven:
eg: timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
And it is good enough to show the problem. Period. The
rest (you have to do this, you should not do this, why
are you using these characters - amazing and stupid
question -) does not count.
The real pro
Chris Angelico writes:
> And of course, taking the *entire* rest of the string isn't the only
> thing you do. What if you want to take the next six characters after
> that index? That would be constant time with a fixed-width storage
> format.
How often is this an issue in practice?
I wonder how
On Sun, Aug 19, 2012 at 6:11 PM, Paul Rubin wrote:
> Steven D'Aprano writes:
>> result = text[end:]
>
> if end not near the end of the original string, then this is O(N)
> even with fixed-width representation, because of the char copying.
>
> if it is near the end, by knowing where the string
Steven D'Aprano writes:
> result = text[end:]
if end not near the end of the original string, then this is O(N)
even with fixed-width representation, because of the char copying.
if it is near the end, by knowing where the string data area
ends, I think it should be possible to scan backward
Steven D'Aprano writes:
> This is a long post. If you don't feel like reading an essay, skip to the
> very bottom and read my last few paragraphs, starting with "To recap".
I'm very flattered that you took the trouble to write that excellent
exposition of different Unicode encodings in response
On Sat, 18 Aug 2012 19:35:44 -0700, Paul Rubin wrote:
> Scanning 4 characters (or a few dozen, say) to peel off a token in
> parsing a UTF-8 string is no big deal. It gets more expensive if you
> want to index far more deeply into the string. I'm asking how often
> that is done in real code.
It
Steven D'Aprano wrote:
> On Sat, 18 Aug 2012 19:34:50 +0100, MRAB wrote:
>
>> "a" will be stored as 1 byte/codepoint.
>>
>> Adding "é", it will still be stored as 1 byte/codepoint.
>
> Wrong. It will be 2 bytes, just like it already is in Python 3.2.
>
> I don't know where people are getting t
On Sat, 18 Aug 2012 19:59:32 +0100, MRAB wrote:
> The problem with strings containing surrogate pairs is that you could
> inadvertently slice the string in the middle of the surrogate pair.
That's the *least* of the problems with surrogate pairs. That would be
easy to fix: check the point of the
On Sat, 18 Aug 2012 19:34:50 +0100, MRAB wrote:
> "a" will be stored as 1 byte/codepoint.
>
> Adding "é", it will still be stored as 1 byte/codepoint.
Wrong. It will be 2 bytes, just like it already is in Python 3.2.
I don't know where people are getting this myth that PEP 393 uses Latin-1
int
On Sat, 18 Aug 2012 09:51:37 -0600, Ian Kelly wrote about PEP 393:
> The change does not just benefit ASCII users. It primarily benefits
> anybody using a wide unicode build with strings mostly containing only
> BMP characters.
Just to be clear:
If you have many strings which are *mostly* BMP,
On Sat, 18 Aug 2012 11:05:07 -0700, wxjmfauth wrote:
> As I understand (I think) the undelying mechanism, I can only say, it is
> not a surprise that it happens.
>
> Imagine an editor, I type an "a", internally the text is saved as ascii,
> then I type en "é", the text can only be saved in at lea
On Sat, 18 Aug 2012 11:30:19 -0700, wxjmfauth wrote:
>> > I'm aware of this (and all the blah blah blah you are explaining).
>> > This always the same song. Memory.
>>
>>
>>
>> Exactly. The reason it is always the same song is because it is an
>> important song.
>>
>>
> No offense here. But t
This is a long post. If you don't feel like reading an essay, skip to the
very bottom and read my last few paragraphs, starting with "To recap".
On Sat, 18 Aug 2012 11:26:21 -0700, Paul Rubin wrote:
> Steven D'Aprano writes:
>> (There is an extension to UCS-2, UTF-16, which encodes non-BMP
>>
Chris Angelico writes:
> Generally, I'm working with pure ASCII, but port those same algorithms
> to Python and you'll easily be able to read in a file in some known
> encoding and manipulate it as Unicode.
If it's pure ASCII, you can use the bytes or bytearray type.
> It's not so much 'random
On Sun, Aug 19, 2012 at 1:10 PM, Paul Rubin wrote:
> Chris Angelico writes:
>> I don't have a Python example of parsing a huge string, but I've done
>> it in other languages, and when I can depend on indexing being a cheap
>> operation, I'll happily do exactly that.
>
> I'd be interested to know
Chris Angelico writes:
> Sure, four characters isn't a big deal to step through. But it still
> makes indexing and slicing operations O(N) instead of O(1), plus you'd
> have to zark the whole string up to where you want to work.
I know some systems chop the strings into blocks of (say) a few
hund
On 8/18/2012 4:09 PM, Terry Reedy wrote:
print(timeit("c in a", "c = '…'; a = 'a'*1000+c"))
# .6 in 3.2.3, 1.2 in 3.3.0
This does not make sense to me and I will ask about it.
I did ask on pydef list and paraphrased responses include:
1. 'My system gives opposite ratios.'
2. 'With a default
On Sun, Aug 19, 2012 at 12:35 PM, Paul Rubin wrote:
> Chris Angelico writes:
> "asdfqwer"[4:]
>> 'qwer'
>>
>> That's a not uncommon operation when parsing strings or manipulating
>> data. You'd need to completely rework your algorithms to maintain a
>> position somewhere.
>
> Scanning 4 chara
Chris Angelico writes:
"asdfqwer"[4:]
> 'qwer'
>
> That's a not uncommon operation when parsing strings or manipulating
> data. You'd need to completely rework your algorithms to maintain a
> position somewhere.
Scanning 4 characters (or a few dozen, say) to peel off a token in
parsing a UTF
On Sun, Aug 19, 2012 at 12:11 PM, Paul Rubin wrote:
> Chris Angelico writes:
>> UTF-8 is highly inefficient for indexing. Given a buffer of (say) a
>> few thousand bytes, how do you locate the 273rd character?
>
> How often do you need to do that, as opposed to traversing the string by
> iteratio
Chris Angelico writes:
> UTF-8 is highly inefficient for indexing. Given a buffer of (say) a
> few thousand bytes, how do you locate the 273rd character?
How often do you need to do that, as opposed to traversing the string by
iteration? Anyway, you could use a rope-like implementation, or an
i
On Sun, Aug 19, 2012 at 4:26 AM, Paul Rubin wrote:
> Can you explain the issue of "breaking surrogate pairs apart" a little
> more? Switching between encodings based on the string contents seems
> silly at first glance. Strings are immutable so I don't understand why
> not use UTF-8 or UTF-16 fo
On 18/08/2012 21:22, wxjmfa...@gmail.com wrote:
Le samedi 18 août 2012 20:40:23 UTC+2, rusi a écrit :
On Aug 18, 10:59 pm, Steven D'Aprano wrote:
On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
Is there any reason why non ascii users are somehow penalized compared
to ascii users?
Le samedi 18 août 2012 20:40:23 UTC+2, rusi a écrit :
> On Aug 18, 10:59 pm, Steven D'Aprano
> +comp.lang.pyt...@pearwood.info> wrote:
>
> > On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
>
> > > Is there any reason why non ascii users are somehow penalized compared
>
> > > to ascii user
On 8/18/2012 12:38 PM, wxjmfa...@gmail.com wrote:
Sorry guys, I'm not stupid (I think). I can open IDLE with
Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
always slower. Period.
You have not tried enough tests ;-).
On my Win7-64 system:
from timeit import timeit
print(timeit("
On 18/08/2012 19:40, rusi wrote:
On Aug 18, 10:59 pm, Steven D'Aprano wrote:
On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
Is there any reason why non ascii users are somehow penalized compared
to ascii users?
Of course there is a reason.
If you want to represent 1114111 different ch
On 18/08/2012 19:30, wxjmfa...@gmail.com wrote:
Le samedi 18 août 2012 19:59:18 UTC+2, Steven D'Aprano a écrit :
On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
[...]
The problem with UCS-4 is that every character re
On 18/08/2012 19:26, Paul Rubin wrote:
Steven D'Aprano writes:
(There is an extension to UCS-2, UTF-16, which encodes non-BMP characters
using two code points. This is fragile and doesn't work very well,
because string-handling methods can break the surrogate pairs apart,
leaving you with inval
On Aug 18, 10:59 pm, Steven D'Aprano wrote:
> On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
> > Is there any reason why non ascii users are somehow penalized compared
> > to ascii users?
>
> Of course there is a reason.
>
> If you want to represent 1114111 different characters in a string,
Le samedi 18 août 2012 19:59:18 UTC+2, Steven D'Aprano a écrit :
> On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
>
>
>
> > Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
>
> >> [...]
>
> >> The problem with UCS-4 is that every character requires four bytes.
>
> >> [..
On 18/08/2012 19:05, wxjmfa...@gmail.com wrote:
Le samedi 18 août 2012 19:28:26 UTC+2, Mark Lawrence a écrit :
Proof that is acceptable to everybody please, not just yourself.
I cann't, I'm only facing the fact it works slower on my
Windows platform.
As I understand (I think) the undelying
Steven D'Aprano writes:
> (There is an extension to UCS-2, UTF-16, which encodes non-BMP characters
> using two code points. This is fragile and doesn't work very well,
> because string-handling methods can break the surrogate pairs apart,
> leaving you with invalid unicode string. Not good.)
.
Le samedi 18 août 2012 19:28:26 UTC+2, Mark Lawrence a écrit :
>
> Proof that is acceptable to everybody please, not just yourself.
>
>
I cann't, I'm only facing the fact it works slower on my
Windows platform.
As I understand (I think) the undelying mechanism, I
can only say, it is not a surpr
On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
> Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
>> [...]
>> The problem with UCS-4 is that every character requires four bytes.
>> [...]
>
> I'm aware of this (and all the blah blah blah you are explaining). This
> always the
On 18/08/2012 17:38, wxjmfa...@gmail.com wrote:
Sorry guys, I'm not stupid (I think). I can open IDLE with
Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
always slower. Period.
Proof that is acceptable to everybody please, not just yourself.
Now, the reason. I think it is due
On Sun, Aug 19, 2012 at 2:38 AM, wrote:
> Sorry guys, I'm not stupid (I think). I can open IDLE with
> Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
> always slower. Period.
Ah, but what about all those other operations that use strings under
the covers? As mentioned, namespace l
Sorry guys, I'm not stupid (I think). I can open IDLE with
Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
always slower. Period.
Now, the reason. I think it is due the "flexible represention".
Deeper reason. The "boss" do not wish to hear from a (pure)
ucs-4/utf-32 "engine" (this h
On Sat, Aug 18, 2012 at 9:07 AM, wrote:
> Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
>> [...]
>> The problem with UCS-4 is that every character requires four bytes.
>> [...]
>
> I'm aware of this (and all the blah blah blah you are
> explaining). This always the same song. M
On Sun, Aug 19, 2012 at 1:07 AM, wrote:
> I'm aware of this (and all the blah blah blah you are
> explaining). This always the same song. Memory.
>
> Let me ask. Is Python an 'american" product for us-users
> or is it a tool for everybody [*]?
> Is there any reason why non ascii users are somehow
On 18/08/2012 16:07, wxjmfa...@gmail.com wrote:
Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
[...]
The problem with UCS-4 is that every character requires four bytes.
[...]
I'm aware of this (and all the blah blah blah you are
explaining). This always the same song. Memory.
(Resending this to the list because I previously sent it only to
Steven by mistake. Also showing off a case where top-posting is
reasonable, since this bit requires no context. :-)
On Sat, Aug 18, 2012 at 1:41 AM, Ian Kelly wrote:
>
> On Aug 17, 2012 10:17 PM, "Steven D'Aprano"
> wrote:
>>
>> U
Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
> [...]
> The problem with UCS-4 is that every character requires four bytes.
> [...]
I'm aware of this (and all the blah blah blah you are
explaining). This always the same song. Memory.
Let me ask. Is Python an 'american" product
On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:
sys.version
> '3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]'
timeit.timeit("('ab…' * 1000).replace('…', '……')")
> 37.32762490493721
> timeit.timeit("('ab…' * 10).replace('…', 'œ…')") 0.8158757139801764
>
sys
>>> sys.version
'3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]'
>>> timeit.timeit("('ab…' * 1000).replace('…', '……')")
37.32762490493721
timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
0.8158757139801764
>>> sys.version
'3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:02:36)
On Fri, 17 Aug 2012 23:30:22 -0400, Dave Angel wrote:
> On 08/17/2012 08:21 PM, Ian Kelly wrote:
>> On Aug 17, 2012 2:58 PM, "Dave Angel" wrote:
>>> The internal coding described in PEP 393 has nothing to do with
>>> latin-1 encoding.
>> It certainly does. PEP 393 provides for Unicode strings to
On Fri, 17 Aug 2012 11:45:02 -0700, wxjmfauth wrote:
> Le vendredi 17 août 2012 20:21:34 UTC+2, Jerry Hill a écrit :
>> On Fri, Aug 17, 2012 at 1:49 PM, wrote:
>>
>> > The character '…', Unicode name 'HORIZONTAL ELLIPSIS',
>> > is one of these characters existing in the cp1252, mac-roman
>> > c
On 08/17/2012 08:21 PM, Ian Kelly wrote:
> On Aug 17, 2012 2:58 PM, "Dave Angel" wrote:
>> The internal coding described in PEP 393 has nothing to do with latin-1
>> encoding.
> It certainly does. PEP 393 provides for Unicode strings to be represented
> internally as any of Latin-1, UCS-2, or UCS-
On Aug 17, 2012 2:58 PM, "Dave Angel" wrote:
>
> The internal coding described in PEP 393 has nothing to do with latin-1
> encoding.
It certainly does. PEP 393 provides for Unicode strings to be represented
internally as any of Latin-1, UCS-2, or UCS-4, whichever is smallest and
sufficient to con
On 08/17/2012 02:45 PM, wxjmfa...@gmail.com wrote:
> Le vendredi 17 août 2012 20:21:34 UTC+2, Jerry Hill a écrit :
>>
>>
>> I don't understand what any of this has to do with Python. Just
>>
>> output your text in UTF-8 like any civilized person in the 21st
>>
>> century, and none of that is a pr
Le vendredi 17 août 2012 20:21:34 UTC+2, Jerry Hill a écrit :
> On Fri, Aug 17, 2012 at 1:49 PM, wrote:
>
> > The character '…', Unicode name 'HORIZONTAL ELLIPSIS',
>
> > is one of these characters existing in the cp1252, mac-roman
>
> > coding schemes and not in iso-8859-1 (latin-1) and obvio
On Fri, Aug 17, 2012 at 1:49 PM, wrote:
> The character '…', Unicode name 'HORIZONTAL ELLIPSIS',
> is one of these characters existing in the cp1252, mac-roman
> coding schemes and not in iso-8859-1 (latin-1) and obviously
> not in ascii. It causes Py3.3 to work a few 100% slower
> than Py<3.3 ve
Le vendredi 17 août 2012 01:59:31 UTC+2, Terry Reedy a écrit :
> a = '…'
>
> print(ord(a))
>
> >>>
>
> 8230
>
> Most things with unicode are easier in 3.x, and some are even better in
>
> 3.3. The current beta is good enough for most informal work. 3.3.0 will
>
> be out in a month.
>
>
>
On Thu, 16 Aug 2012 15:09:47 -0700, Charles Jensen wrote:
> Everyone knows that the python command
>
> ord(u'…')
>
> will output the number 8230 which is the unicode character for the
> horizontal ellipsis.
>
> How would I use ord() to find the unicode value of a string stored in a
> varia
a = '…'
print(ord(a))
>>>
8230
Most things with unicode are easier in 3.x, and some are even better in
3.3. The current beta is good enough for most informal work. 3.3.0 will
be out in a month.
--
Terry Jan Reedy
--
http://mail.python.org/mailman/listinfo/python-list
On 08/16/2012 06:09 PM, Charles Jensen wrote:
> Everyone knows that the python command
>
> ord(u'…')
>
> will output the number 8230 which is the unicode character for the horizontal
> ellipsis.
>
> How would I use ord() to find the unicode value of a string stored in a
> variable?
>
> So
On Fri, Aug 17, 2012 at 8:09 AM, Charles Jensen
wrote:
> How would I use ord() to find the unicode value of a string stored in a
> variable?
>
> So the following 2 lines of code will give me the ascii value of the variable
> a. How do I specify ord to give me the unicode value of a?
>
> a
Everyone knows that the python command
ord(u'…')
will output the number 8230 which is the unicode character for the horizontal
ellipsis.
How would I use ord() to find the unicode value of a string stored in a
variable?
So the following 2 lines of code will give me the ascii value of th
98 matches
Mail list logo