On 2013-10-09, Ned Batchelder wrote:
> On 10/9/13 4:22 AM, wxjmfa...@gmail.com wrote:
>> and what Unicode.org does not say is that these coding schemes
>> (like any coding scheme) should be used in an exclusive way.
>
> Can you clarify what you mean by "in an exclusive way"?
Ned, pay no attention
On 10/9/13 4:22 AM, wxjmfa...@gmail.com wrote:
Le mercredi 9 octobre 2013 08:20:05 UTC+2, Steven D'Aprano a écrit :
http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf#G13708 "All three
encoding forms can be used to represent the full range of encoded
characters in the Unicode Standard; ...
Le mercredi 9 octobre 2013 08:20:05 UTC+2, Steven D'Aprano a écrit :
>
>
> > http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf#G13708 "All three
>
> > encoding forms can be used to represent the full range of encoded
>
> > characters in the Unicode Standard; ... Each of the three Unicode
>
On Tue, 08 Oct 2013 21:28:25 -0400, Terry Reedy wrote:
> On 10/8/2013 6:30 PM, Steven D'Aprano wrote:
>> On Tue, 08 Oct 2013 15:14:33 +, Neil Cerutti wrote:
>>
>>> In any case, "\ud800\udc01" isn't a valid unicode string.
>>
>> I don't think this is correct. Can you show me where the standard
On 10/8/2013 6:30 PM, Steven D'Aprano wrote:
On Tue, 08 Oct 2013 15:14:33 +, Neil Cerutti wrote:
In any case, "\ud800\udc01" isn't a valid unicode string.
I don't think this is correct. Can you show me where the standard says
that Unicode strings[1] may not contain surrogates? I think tha
On Tue, 08 Oct 2013 15:14:33 +, Neil Cerutti wrote:
> In any case, "\ud800\udc01" isn't a valid unicode string.
I don't think this is correct. Can you show me where the standard says
that Unicode strings[1] may not contain surrogates? I think that is a
critical point, and the FAQ conflates
On Tue, 08 Oct 2013 18:00:58 +0100, MRAB wrote:
> The only time you should get a surrogate pair in a Unicode string is in
> a narrow build, which doesn't exist in Python 3.3 and later.
Incorrect.
py> sys.version
'3.3.0rc3 (default, Sep 27 2012, 18:44:58) \n[GCC 4.1.2 20080704 (Red Hat
4.1.2-52)
On 10/8/2013 5:47 PM, Terry Reedy wrote:
On 10/8/2013 9:52 AM, Steven D'Aprano wrote:
But reading the previous entry in the FAQs:
http://www.unicode.org/faq/utf_bom.html#utf8-4
I interpret this as meaning that I should be able to encode valid pairs
of surrogates.
It says you should be able
On 10/8/2013 9:52 AM, Steven D'Aprano wrote:
I think this is a bug in Python's UTF-8 handling, but I'm not sure.
If I've read the Unicode FAQs correctly, you cannot encode *lone*
surrogate code points into UTF-8:
http://www.unicode.org/faq/utf_bom.html#utf8-5
Sure enough, using Python 3.3:
py
>>> sys.version
'3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)]'
>>> '\ud800'.encode('utf-8')
Traceback (most recent call last):
File "", line 1, in
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position
0:
surrogates not allowed
On 08/10/2013 16:23, Pete Forman wrote:
Steven D'Aprano writes:
I think this is a bug in Python's UTF-8 handling, but I'm not sure.
[snip]
py> s = '\ud800\udc01'
py> s.encode('utf-8')
Traceback (most recent call last):
File "", line 1, in
UnicodeEncodeError: 'utf-8' codec can't encode cha
On 2013-10-08, Neil Cerutti wrote:
> In any case, "\ud800\udc01" isn't a valid unicode string. In a
> perfect world it would automatically get converted to
> '\u00010001' without intervention.
This last paragraph is erroneous. I must have had a typo in my
testing.
--
Neil Cerutti
--
https://ma
Steven D'Aprano writes:
> I think this is a bug in Python's UTF-8 handling, but I'm not sure.
[snip]
> py> s = '\ud800\udc01'
> py> s.encode('utf-8')
> Traceback (most recent call last):
> File "", line 1, in
> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
> position 0:
On 2013-10-08, Steven D'Aprano wrote:
> py> c = '\N{LINEAR B SYLLABLE B038 E}'
> py> surr_pair = c.encode('utf-16be')
> py> print(surr_pair)
> b'\xd8\x00\xdc\x01'
>
> and then use those same values as the code points, I ought to be able to
> encode to UTF-8, as if it were the same \N{LINEAR B SYL
I think this is a bug in Python's UTF-8 handling, but I'm not sure.
If I've read the Unicode FAQs correctly, you cannot encode *lone*
surrogate code points into UTF-8:
http://www.unicode.org/faq/utf_bom.html#utf8-5
Sure enough, using Python 3.3:
py> surr = '\udc80'
py> surr.encode('utf-8')
Tra
15 matches
Mail list logo