Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Alastair Houghton via Unicode
On 31 May 2017, at 20:24, Shawn Steele via Unicode wrote: > > > For implementations that emit FFFD while handling text conversion and > > repair (ie, converting ill-formed > > UTF-8 to well-formed), it is best for interoperability if they get the same > > results, so that indices within the > >

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Alastair Houghton via Unicode
On 31 May 2017, at 20:42, Shawn Steele via Unicode wrote: > >> And *that* is what the specification says. The whole problem here is that >> someone elevated >> one choice to the status of “best practice”, and it’s a choice that some of >> us don’t think *should* >> be considered best practice.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Henri Sivonen via Unicode
On Wed, May 31, 2017 at 8:11 PM, Richard Wordingham via Unicode wrote: > On Wed, 31 May 2017 15:12:12 +0300 > Henri Sivonen via Unicode wrote: >> I am not claiming it's too difficult to implement. I think it >> inappropriate to ask implementations, even from-scratch ones, to take >> on added comp

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Alastair Houghton via Unicode
On 1 Jun 2017, at 10:32, Henri Sivonen via Unicode wrote: > > On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode > wrote: >> * As far as I can tell, there are two (maybe three) sane approaches to this >> problem: >>* Either a "maximal" emission of one U+FFFD for every byte that

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Asmus Freytag via Unicode
On 6/1/2017 2:32 AM, Henri Sivonen via Unicode wrote: O On Wed, May 31, 2017 at 10:38 PM, Doug Ewell via Unicode wrote: Henri Sivonen wrote: If anything, I hope this thread results in the establishment of a requirement for proposals to come with proper research about what multiple prominent i

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Shawn Steele via Unicode
I think that the (or a) key problem is that the current "best practice" is treated as "SHOULD" in RFC parlance. When what this really needs is a "MAY". People reading standards tend to treat "SHOULD" and "MUST" as the same thing. So, when an implementation deviates, then you get bugs (as we se

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Asmus Freytag via Unicode
On 6/1/2017 10:41 AM, Shawn Steele via Unicode wrote: I think that the (or a) key problem is that the current "best practice" is treated as "SHOULD" in RFC parlance. When what this really needs is a "MAY". People reading standards tend to treat "SHOULD" and "MUST"

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Shawn Steele via Unicode
But those are IETF definitions. They don’t have to mean the same thing in Unicode - except that people working in this field probably expect them to. From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Asmus Freytag via Unicode Sent: Thursday, June 1, 2017 11:44 AM To: unicode@unico

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Richard Wordingham via Unicode
On Thu, 1 Jun 2017 12:32:08 +0300 Henri Sivonen via Unicode wrote: > On Wed, May 31, 2017 at 8:11 PM, Richard Wordingham via Unicode > wrote: > > On Wed, 31 May 2017 15:12:12 +0300 > > Henri Sivonen via Unicode wrote: > >> I am not claiming it's too difficult to implement. I think it > >> ina

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Asmus Freytag (c) via Unicode
On 6/1/2017 11:53 AM, Shawn Steele wrote: But those are IETF definitions. They don’t have to mean the same thing in Unicode - except that people working in this field probably expect them to. That's the thing. And even if Unicode had it's own version of RFC 2119 one would considered it r

Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Doug Ewell via Unicode
Richard Wordingham wrote: > even supporting 6-byte patterns just in case 20.1 bits eventually turn > out not to be enough, Oh, gosh, here we go with this. What will we do if 31 bits turn out not to be enough? -- Doug Ewell | Thornton, CO, US | ewellic.org

Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Richard Wordingham via Unicode
On Thu, 01 Jun 2017 12:54:45 -0700 Doug Ewell via Unicode wrote: > Richard Wordingham wrote: > > > even supporting 6-byte patterns just in case 20.1 bits eventually > > turn out not to be enough, > > Oh, gosh, here we go with this. You were implicitly invited to argue that there was no need

Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Philippe Verdy via Unicode
This is still very unlikely to occur. Lot of discussions about emojis but they still don't count a lot in the total. The major updates were epected for CJK sinograms, but even the rate of updates has slowed down and we will eventually will have another sinographic plane, but it will not come soon a

Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Ken Whistler via Unicode
On 6/1/2017 2:39 PM, Richard Wordingham via Unicode wrote: You were implicitly invited to argue that there was no need to handle 5 and 6 byte invalid sequences. Well, working from the *current* specification: FC 80 80 80 80 80 and FF FF FF FF FF FF are equal trash, uninterpretable as *anyth

Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Richard Wordingham via Unicode
On Thu, 1 Jun 2017 17:10:54 -0700 Ken Whistler via Unicode wrote: > On 6/1/2017 2:39 PM, Richard Wordingham via Unicode wrote: > > You were implicitly invited to argue that there was no need to > > handle 5 and 6 byte invalid sequences. > > > > Well, working from the *current* specification: >

Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Richard Wordingham via Unicode
On Thu, 1 Jun 2017 17:10:54 -0700 Ken Whistler via Unicode wrote: > Well, working from the *current* specification: > > FC 80 80 80 80 80 > and > FF FF FF FF FF FF > > are equal trash, uninterpretable as *anything* in UTF-8. > > By definition D39b, either sequence of bytes, if encountered by a

Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Ken Whistler via Unicode
On 6/1/2017 6:21 PM, Richard Wordingham via Unicode wrote: By definition D39b, either sequence of bytes, if encountered by an conformant UTF-8 conversion process, would be interpreted as a sequence of 6 maximal subparts of an ill-formed subsequence. ("D39b" is a typo for "D93b".) Sorry about

Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Richard Wordingham via Unicode
On Thu, 1 Jun 2017 19:19:51 -0700 Ken Whistler via Unicode wrote: > > and therefore should start a > > sequence of 6 characters. > > That is completely false, and has nothing to do with the current > definition of UTF-8. > > The current, normative definition of UTF-8, in the Unicode Standa

Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Ken Whistler via Unicode
On 6/1/2017 8:32 PM, Richard Wordingham via Unicode wrote: TUS Section 3 is like the Augean Stables. It is a complete mess as a standards document, That is a matter of editorial taste, I suppose. imputing mental states to computing processes. That, however, is false. The rhetorical turn i