Re: unicode, bytes redux

2006-09-25 Thread Fredrik Lundh
willie wrote: > Is it too ridiculous to suggest that it'd be nice > if the unicode object were to remember the > encoding of the string it was decoded from? > So that it's feasible to calculate the number > of bytes that make up the unicode code points. > > # U+270C > # 11100010 10011100 10001100

Re: unicode, bytes redux

2006-09-25 Thread Walter Dörwald
Steven D'Aprano wrote: > On Mon, 25 Sep 2006 00:45:29 -0700, Paul Rubin wrote: > >> willie <[EMAIL PROTECTED]> writes: >>> # U+270C >>> # 11100010 10011100 10001100 >>> buf = "\xE2\x9C\x8C" >>> u = buf.decode('UTF-8') >>> # ... later ... >>> u.bytes() -> 3 >>> >>> (goes through each code point and

Re: unicode, bytes redux

2006-09-25 Thread Fredrik Lundh
John Machin wrote: >> > So all he needs is a boolean result: u.willitfit(encoding, width) >> >> at what point in the program would that method be used ? > > Never, I hope. Were you taking that as a serious suggestion? Fredrik, > perhaps your irony detector needs a little preventative maintenance :-

Re: unicode, bytes redux

2006-09-25 Thread John Machin
Fredrik Lundh wrote: > John Machin wrote: > > > Actually, what Willie was concerned about was some cockamamie DBMS > > which required to be fed Unicode, which it encoded as UTF-8, but > > silently truncated if it was more than the n in varchar(n) ... or > > something like that. > > > > So all he n

Re: unicode, bytes redux

2006-09-25 Thread John Roth
willie wrote: > (beating a dead horse) > > Is it too ridiculous to suggest that it'd be nice > if the unicode object were to remember the > encoding of the string it was decoded from? > So that it's feasible to calculate the number > of bytes that make up the unicode code points. > > # U+270C > #

Re: unicode, bytes redux

2006-09-25 Thread Fredrik Lundh
John Machin wrote: > Actually, what Willie was concerned about was some cockamamie DBMS > which required to be fed Unicode, which it encoded as UTF-8, but > silently truncated if it was more than the n in varchar(n) ... or > something like that. > > So all he needs is a boolean result: u.willitfit

Re: unicode, bytes redux

2006-09-25 Thread John Machin
Paul Rubin wrote: > Leif K-Brooks <[EMAIL PROTECTED]> writes: > > It requires a fairly large change to code and API for a relatively > > uncommon problem. How often do you need to know how many bytes an > > encoded Unicode string takes up without needing the encoded string > > itself? > > Shrug. I

Re: unicode, bytes redux

2006-09-25 Thread Steven D'Aprano
On Mon, 25 Sep 2006 00:45:29 -0700, Paul Rubin wrote: > willie <[EMAIL PROTECTED]> writes: >> # U+270C >> # 11100010 10011100 10001100 >> buf = "\xE2\x9C\x8C" >> u = buf.decode('UTF-8') >> # ... later ... >> u.bytes() -> 3 >> >> (goes through each code point and calculates >> the number of bytes

Re: unicode, bytes redux

2006-09-25 Thread John Machin
Paul Rubin wrote: > "John Machin" <[EMAIL PROTECTED]> writes: > > Actually, what Willie was concerned about was some cockamamie DBMS > > which required to be fed Unicode, which it encoded as UTF-8, > > Yeah, I remember that. > > > Tell you what, why don't you and Willie get together and write a PE

Re: unicode, bytes redux

2006-09-25 Thread Paul Rubin
"John Machin" <[EMAIL PROTECTED]> writes: > Actually, what Willie was concerned about was some cockamamie DBMS > which required to be fed Unicode, which it encoded as UTF-8, Yeah, I remember that. > Tell you what, why don't you and Willie get together and write a PEP? If enough people care about

Re: unicode, bytes redux

2006-09-25 Thread John Machin
willie wrote: > (beating a dead horse) > > Is it too ridiculous to suggest that it'd be nice > if the unicode object were to remember the > encoding of the string it was decoded from? Where it's been is irrelevant. Where it's going to is what matters. > So that it's feasible to calculate the numb

Re: unicode, bytes redux

2006-09-25 Thread Paul Rubin
Leif K-Brooks <[EMAIL PROTECTED]> writes: > It requires a fairly large change to code and API for a relatively > uncommon problem. How often do you need to know how many bytes an > encoded Unicode string takes up without needing the encoded string > itself? Shrug. I don't see a real large change--

Re: unicode, bytes redux

2006-09-25 Thread Leif K-Brooks
Paul Rubin wrote: > Duncan Booth explains why that doesn't work. But I don't see any big > problem with a byte count function that lets you specify an encoding: > > u = buf.decode('UTF-8') > # ... later ... > u.bytes('UTF-8') -> 3 > u.bytes('UCS-4') -> 4 > > That avoids creat

Re: unicode, bytes redux

2006-09-25 Thread Paul Rubin
willie <[EMAIL PROTECTED]> writes: > # U+270C > # 11100010 10011100 10001100 > buf = "\xE2\x9C\x8C" > u = buf.decode('UTF-8') > # ... later ... > u.bytes() -> 3 > > (goes through each code point and calculates > the number of bytes that make up the character > according to the encoding) Duncan Bo

Re: unicode, bytes redux

2006-09-25 Thread Duncan Booth
willie <[EMAIL PROTECTED]> wrote: > Is it too ridiculous to suggest that it'd be nice > if the unicode object were to remember the > encoding of the string it was decoded from? > So that it's feasible to calculate the number > of bytes that make up the unicode code points. So what sort of output

Re: unicode, bytes redux

2006-09-24 Thread Robert Kern
willie wrote: > (beating a dead horse) > > Is it too ridiculous to suggest that it'd be nice > if the unicode object were to remember the > encoding of the string it was decoded from? Yes. The unicode object itself is precisely the wrong place for that kind of information. Many (most?) unicode o