Re: UTF-8 question from Dive into Python 3

2011-01-20 Thread jmfauth
On Jan 19, 11:33 pm, Terry Reedy wrote: > On 1/19/2011 1:02 PM, Tim Harig wrote: > > > Right, but I only have to do that once.  After that, I can directly address > > any piece of the stream that I choose.  If I leave the information as a > > simple UTF-8 stream, I would have to walk the stream ag

Re: UTF-8 question from Dive into Python 3

2011-01-19 Thread Terry Reedy
On 1/19/2011 1:02 PM, Tim Harig wrote: Right, but I only have to do that once. After that, I can directly address any piece of the stream that I choose. If I leave the information as a simple UTF-8 stream, I would have to walk the stream again, I would have to walk through the the first byte o

Re: UTF-8 question from Dive into Python 3

2011-01-19 Thread Antoine Pitrou
On Wed, 19 Jan 2011 19:18:49 + (UTC) Tim Harig wrote: > On 2011-01-19, Antoine Pitrou wrote: > > On Wed, 19 Jan 2011 18:02:22 + (UTC) > > Tim Harig wrote: > >> Converting to a fixed byte > >> representation (UTF-32/UCS-4) or separating all of the bytes for each > >> UTF-8 into 6 byte con

Re: UTF-8 question from Dive into Python 3

2011-01-19 Thread Tim Harig
On 2011-01-19, Antoine Pitrou wrote: > On Wed, 19 Jan 2011 18:02:22 + (UTC) > Tim Harig wrote: >> Converting to a fixed byte >> representation (UTF-32/UCS-4) or separating all of the bytes for each >> UTF-8 into 6 byte containers both make it possible to simply index the >> letters by a const

Re: UTF-8 question from Dive into Python 3

2011-01-19 Thread Antoine Pitrou
On Wed, 19 Jan 2011 18:02:22 + (UTC) Tim Harig wrote: > On 2011-01-19, Antoine Pitrou wrote: > > On Wed, 19 Jan 2011 16:03:11 + (UTC) > > Tim Harig wrote: > >> > >> For many operations, it is just much faster and simpler to use a single > >> character based container opposed to having t

Re: UTF-8 question from Dive into Python 3

2011-01-19 Thread Tim Harig
On 2011-01-19, Antoine Pitrou wrote: > On Wed, 19 Jan 2011 16:03:11 + (UTC) > Tim Harig wrote: >> >> For many operations, it is just much faster and simpler to use a single >> character based container opposed to having to process an entire byte >> stream to determine individual letters from

Re: UTF-8 question from Dive into Python 3

2011-01-19 Thread Antoine Pitrou
On Wed, 19 Jan 2011 16:03:11 + (UTC) Tim Harig wrote: > > For many operations, it is just much faster and simpler to use a single > character based container opposed to having to process an entire byte > stream to determine individual letters from the bytes or to having > adaptive size contai

Re: UTF-8 question from Dive into Python 3

2011-01-19 Thread Tim Harig
On 2011-01-19, Antoine Pitrou wrote: > On Wed, 19 Jan 2011 14:00:13 + (UTC) > Tim Harig wrote: >> UTF-8 has no apparent endianess if you only store it as a byte stream. >> It does however have a byte order. If you store it using multibytes >> (six bytes for all UTF-8 possibilites) , which is

Re: UTF-8 question from Dive into Python 3

2011-01-19 Thread Tim Harig
On 2011-01-19, Adam Skutt wrote: > On Jan 19, 9:00 am, Tim Harig wrote: >> That is why I say that byte streams are essentially big endian. It is >> all a matter of how you look at it. > > It is nothing of the sort. Some byte streams are in fact, little > endian: when the bytes are combined into

Re: UTF-8 question from Dive into Python 3

2011-01-19 Thread Adam Skutt
On Jan 19, 9:00 am, Tim Harig wrote: > > So, you can always assume a big-endian and things will work out correctly > while you cannot always make the same assumption as little endian > without potential issues.  The same holds true for any byte stream data. You need to spend some serious time pro

Re: UTF-8 question from Dive into Python 3

2011-01-19 Thread Antoine Pitrou
On Wed, 19 Jan 2011 14:00:13 + (UTC) Tim Harig wrote: > > - Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If > - yes, then can I still assume the remaining UTF-8 bytes are in big-endian > ^^ > - or

Re: UTF-8 question from Dive into Python 3

2011-01-19 Thread Tim Harig
Considering you post contained no information or evidence for your negations, I shouldn't even bother responding. I will bite once. Hopefully next time your arguments will contain some pith. On 2011-01-19, Antoine Pitrou wrote: > On Wed, 19 Jan 2011 11:34:53 + (UTC) > Tim Harig wrote: >> Th

Re: UTF-8 question from Dive into Python 3

2011-01-19 Thread Antoine Pitrou
On Wed, 19 Jan 2011 11:34:53 + (UTC) Tim Harig wrote: > That is why the FAQ I linked to > says yes to the fact that you can consider UTF-8 to always be in big-endian > order. It certainly doesn't. Read better. > Essentially all byte based data is big-endian. This is pure nonsense. -- htt

Re: UTF-8 question from Dive into Python 3

2011-01-19 Thread Tim Harig
On 2011-01-19, Tim Roberts wrote: > Tim Harig wrote: >>On 2011-01-17, carlo wrote: >> >>> 2- If that were true, can you point me to some documentation about the >>> math that, as Mark says, demonstrates this? >> >>It is true because UTF-8 is essentially an 8 bit encoding that resorts >>to the ne

Re: UTF-8 question from Dive into Python 3

2011-01-18 Thread Tim Roberts
Tim Harig wrote: >On 2011-01-17, carlo wrote: > >> 2- If that were true, can you point me to some documentation about the >> math that, as Mark says, demonstrates this? > >It is true because UTF-8 is essentially an 8 bit encoding that resorts >to the next bit once it exhausts the addressible spac

Re: UTF-8 question from Dive into Python 3

2011-01-18 Thread Raymond Hettinger
On Jan 17, 2:19 pm, carlo wrote: > Hi, > recently I had to study *seriously* Unicode and encodings for one > project in Python but I left with a couple of doubts arised after > reading the unicode chapter of Dive into Python 3 book by Mark > Pilgrim. > > 1- Mark says: > "Also (and you’ll have to t

Re: UTF-8 question from Dive into Python 3

2011-01-17 Thread carlo
On 17 Gen, 23:34, Antoine Pitrou wrote: > On Mon, 17 Jan 2011 14:19:13 -0800 (PST) > > carlo wrote: > > Is it true UTF-8 does not have any "big-endian/little-endian" issue > > because of its encoding method? > > Yes. > > > And if it is true, why Mark (and > > everyone does) writes about UTF-8 wit

Re: UTF-8 question from Dive into Python 3

2011-01-17 Thread Antoine Pitrou
On Mon, 17 Jan 2011 14:19:13 -0800 (PST) carlo wrote: > Is it true UTF-8 does not have any "big-endian/little-endian" issue > because of its encoding method? Yes. > And if it is true, why Mark (and > everyone does) writes about UTF-8 with and without BOM some chapters > later? What would be the

Re: UTF-8 question from Dive into Python 3

2011-01-17 Thread Tim Harig
On 2011-01-17, carlo wrote: > Is it true UTF-8 does not have any "big-endian/little-endian" issue > because of its encoding method? And if it is true, why Mark (and > everyone does) writes about UTF-8 with and without BOM some chapters > later? What would be the BOM purpose then? Yes, it is true.

Re: UTF-8 question from Dive into Python 3

2011-01-17 Thread Alexander Kapps
On 17.01.2011 23:19, carlo wrote: Is it true UTF-8 does not have any "big-endian/little-endian" issue because of its encoding method? And if it is true, why Mark (and everyone does) writes about UTF-8 with and without BOM some chapters later? What would be the BOM purpose then? Can't answer yo

UTF-8 question from Dive into Python 3

2011-01-17 Thread carlo
Hi, recently I had to study *seriously* Unicode and encodings for one project in Python but I left with a couple of doubts arised after reading the unicode chapter of Dive into Python 3 book by Mark Pilgrim. 1- Mark says: "Also (and you’ll have to trust me on this, because I’m not going to show yo