Re: unicode, bytes redux

2006-09-25 Thread Fredrik Lundh
willie wrote: > Is it too ridiculous to suggest that it'd be nice > if the unicode object were to remember the > encoding of the string it was decoded from? > So that it's feasible to calculate the number > of bytes that make up the unicode code points. > > # U+270C > # 11100010 10011100 10001100

Re: unicode, bytes redux

2006-09-25 Thread Walter Dörwald
Steven D'Aprano wrote: > On Mon, 25 Sep 2006 00:45:29 -0700, Paul Rubin wrote: > >> willie <[EMAIL PROTECTED]> writes: >>> # U+270C >>> # 11100010 10011100 10001100 >>> buf = "\xE2\x9C\x8C" >>> u = buf.decode('UTF-8') >>> # ... later ... >>> u.bytes() -> 3 >>> >>> (goes through each code point and

Re: unicode, bytes redux

2006-09-25 Thread Fredrik Lundh
John Machin wrote: >> > So all he needs is a boolean result: u.willitfit(encoding, width) >> >> at what point in the program would that method be used ? > > Never, I hope. Were you taking that as a serious suggestion? Fredrik, > perhaps your irony detector needs a little preventative maintenance :-

Re: unicode, bytes redux

2006-09-25 Thread John Machin
Fredrik Lundh wrote: > John Machin wrote: > > > Actually, what Willie was concerned about was some cockamamie DBMS > > which required to be fed Unicode, which it encoded as UTF-8, but > > silently truncated if it was more than the n in varchar(n) ... or > > something like that. > > > > So all he n

Re: unicode, bytes redux

2006-09-25 Thread John Roth
willie wrote: > (beating a dead horse) > > Is it too ridiculous to suggest that it'd be nice > if the unicode object were to remember the > encoding of the string it was decoded from? > So that it's feasible to calculate the number > of bytes that make up the unicode code points. > > # U+270C > #

Re: unicode, bytes redux

2006-09-25 Thread Fredrik Lundh
John Machin wrote: > Actually, what Willie was concerned about was some cockamamie DBMS > which required to be fed Unicode, which it encoded as UTF-8, but > silently truncated if it was more than the n in varchar(n) ... or > something like that. > > So all he needs is a boolean result: u.willitfit

Re: unicode, bytes redux

2006-09-25 Thread John Machin
Paul Rubin wrote: > Leif K-Brooks <[EMAIL PROTECTED]> writes: > > It requires a fairly large change to code and API for a relatively > > uncommon problem. How often do you need to know how many bytes an > > encoded Unicode string takes up without needing the encoded string > > itself? > > Shrug. I

Re: unicode, bytes redux

2006-09-25 Thread Steven D'Aprano
On Mon, 25 Sep 2006 00:45:29 -0700, Paul Rubin wrote: > willie <[EMAIL PROTECTED]> writes: >> # U+270C >> # 11100010 10011100 10001100 >> buf = "\xE2\x9C\x8C" >> u = buf.decode('UTF-8') >> # ... later ... >> u.bytes() -> 3 >> >> (goes through each code point and calculates >> the number of bytes

Re: unicode, bytes redux

2006-09-25 Thread John Machin
Paul Rubin wrote: > "John Machin" <[EMAIL PROTECTED]> writes: > > Actually, what Willie was concerned about was some cockamamie DBMS > > which required to be fed Unicode, which it encoded as UTF-8, > > Yeah, I remember that. > > > Tell you what, why don't you and Willie get together and write a PE

Re: unicode, bytes redux

2006-09-25 Thread Paul Rubin
"John Machin" <[EMAIL PROTECTED]> writes: > Actually, what Willie was concerned about was some cockamamie DBMS > which required to be fed Unicode, which it encoded as UTF-8, Yeah, I remember that. > Tell you what, why don't you and Willie get together and write a PEP? If enough people care about

Re: unicode, bytes redux

2006-09-25 Thread John Machin
willie wrote: > (beating a dead horse) > > Is it too ridiculous to suggest that it'd be nice > if the unicode object were to remember the > encoding of the string it was decoded from? Where it's been is irrelevant. Where it's going to is what matters. > So that it's feasible to calculate the numb

Re: unicode, bytes redux

2006-09-25 Thread Paul Rubin
Leif K-Brooks <[EMAIL PROTECTED]> writes: > It requires a fairly large change to code and API for a relatively > uncommon problem. How often do you need to know how many bytes an > encoded Unicode string takes up without needing the encoded string > itself? Shrug. I don't see a real large change--

Re: unicode, bytes redux

2006-09-25 Thread Leif K-Brooks
Paul Rubin wrote: > Duncan Booth explains why that doesn't work. But I don't see any big > problem with a byte count function that lets you specify an encoding: > > u = buf.decode('UTF-8') > # ... later ... > u.bytes('UTF-8') -> 3 > u.bytes('UCS-4') -> 4 > > That avoids creat

Re: unicode, bytes redux

2006-09-25 Thread Paul Rubin
willie <[EMAIL PROTECTED]> writes: > # U+270C > # 11100010 10011100 10001100 > buf = "\xE2\x9C\x8C" > u = buf.decode('UTF-8') > # ... later ... > u.bytes() -> 3 > > (goes through each code point and calculates > the number of bytes that make up the character > according to the encoding) Duncan Bo

Re: unicode, bytes redux

2006-09-25 Thread Duncan Booth
willie <[EMAIL PROTECTED]> wrote: > Is it too ridiculous to suggest that it'd be nice > if the unicode object were to remember the > encoding of the string it was decoded from? > So that it's feasible to calculate the number > of bytes that make up the unicode code points. So what sort of output

Re: unicode, bytes redux

2006-09-24 Thread Robert Kern
willie wrote: > (beating a dead horse) > > Is it too ridiculous to suggest that it'd be nice > if the unicode object were to remember the > encoding of the string it was decoded from? Yes. The unicode object itself is precisely the wrong place for that kind of information. Many (most?) unicode o

unicode, bytes redux

2006-09-24 Thread willie
(beating a dead horse) Is it too ridiculous to suggest that it'd be nice if the unicode object were to remember the encoding of the string it was decoded from? So that it's feasible to calculate the number of bytes that make up the unicode code points. # U+270C # 11100010 10011100 10001100 buf =