>> Disagree. An over-long UTF-8 sequence is clearly a single error. Emitting
>> multiple errors there makes no sense.
>
> Changing a specification as fundamental as this is something that should not
> be undertaken lightly.
IMO, the only think that can be agreed upon is that "something's bad
I’m not sure how the discussion of “which is better” relates to the discussion
of ill-formed UTF-8 at all.
And to the last, saying “you cannot process UTF-16 without handling surrogates”
seems to me to be the equivalent of saying “you cannot process UTF-8 without
handling lead & trail bytes”.
> Would you advocate replacing
> e0 80 80
> with
> U+FFFD U+FFFD U+FFFD (1)
> rather than
> U+FFFD (2)
> It’s pretty clear what the intent of the encoder was there, I’d say, and
> while we certainly don’t
> want to decode it as a NUL (that was the source of previ
: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Richard
Wordingham via Unicode
Sent: Tuesday, May 16, 2017 10:58 AM
To: unicode@unicode.org
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding
ill-formed UTF-8
On Tue, 16 May 2017 17:30:01 +
Shawn
But why change a recommendation just because it “feels like”. As you said,
it’s just a recommendation, so if that really annoyed someone, they could do
something else (eg: they could use a single FFFD).
If the recommendation is truly that meaningless or arbitrary, then we just get
into silly d
> Faster ok, privided this does not break other uses, notably for random
> access within strings…
Either way, this is a “recommendation”. I don’t see how that can provide for
not-“breaking other uses.” If it’s internal, you can do what you will, so if
you need the 1:1 seeming parity, then yo
+ the list, which somehow my reply seems to have lost.
> I may have missed something, but I think nobody actually proposed to change
> the recommendations into requirements
No thanks, that would be a breaking change for some implementations (like mine)
and force them to become non-complying or
> If the thread has made one thing clear is that there's no consensus in the
> wider community
> that one approach is obviously better. When it comes to ill-formed sequences,
> all bets are off.
> Simple as that.
> Adding a "recommendation" this late in the game is just bad standards policy.
I
So basically this came about because code got bugged for not following the
"recommendation." To fix that, the recommendation will be changed. However
then that is going to lead to bugs for other existing code that does not follow
the new recommendation.
I totally get the forward/backward sca
> I think nobody is debating that this is *one way* to do things, and that some
> code does it.
Except that they sort of are. The premise is that the "old language was
wrong", and the "new language is right." The reason we know the old language
was wrong was that there was a bug filed against
> Which is to completely reverse the current recommendation in Unicode 9.0.
> While I agree that this might help you fending off a bug report, it would
> create chances for bug reports for Ruby, Python3, many if not all Web
> browsers,...
& Windows & .Net
Changing the behavior of the Windows /
> Until TUS 3.1, it was legal for UTF-8 parsers to treat the sequence
> as U+002F.
Sort of, maybe. It was not legal for them to generate it though. So you could
kind of infer that it was not a legal sequence.
-Shawn
> > In either case, the bad characters are garbage, so neither approach is
> > "better" - except that one or the other may be more conducive to the
> > requirements of the particular API/application.
> There's a potential issue with input methods that indirectly edit the backing
> store. For e
> For implementations that emit FFFD while handling text conversion and repair
> (ie, converting ill-formed
> UTF-8 to well-formed), it is best for interoperability if they get the same
> results, so that indices within the
> resulting strings are consistent across implementations for all the cor
> it’s more meaningful for whoever sees the output to see a single U+FFFD
> representing
> the illegally encoded NUL that it is to see two U+FFFDs, one for an invalid
> lead byte and
> then another for an “unexpected” trailing byte.
I disagree. It may be more meaningful for some applications
> And *that* is what the specification says. The whole problem here is that
> someone elevated
> one choice to the status of “best practice”, and it’s a choice that some of
> us don’t think *should*
> be considered best practice.
> Perhaps “best practice” should simply be altered to say that yo
ng
ill-formed UTF-8
On 1 Jun 2017, at 10:32, Henri Sivonen via Unicode wrote:
>
> On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode
> wrote:
>> * As far as I can tell, there are two (maybe three) sane approaches to this
>> problem:
>>* Either a &qu
@unicode.org
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding
ill-formed UTF-8
On 6/1/2017 10:41 AM, Shawn Steele via Unicode wrote:
I think that the (or a) key problem is that the current "best practice" is
treated as "SHOULD" in RFC parlance. W
For voice we certainly get clues about the speaker's intent from their tone.
That tone can change the meaning of the same written word quite a bit. There
is no need for video to wildly change the meaning of two different readings of
the exact same words.
Writers have always taken liberties wi
Depends on your perspective I guess ;)
-Original Message-
From: Unicode On Behalf Of Richard Wordingham via
Unicode
Sent: Thursday, February 15, 2018 2:31 PM
To: unicode@unicode.org
Subject: Re: Why so much emoji nonsense? - Proscription
On Thu, 15 Feb 2018 21:38:19 +
Shawn Steele
IMO, trying to do security checks on an encoded string that will be decoded
later is pretty much guaranteed to miss cases. Particularly with ISO-2022-JP,
which has a plethora of variations in how different software/libraries/OS's
decode it and treat the invalid/edge cases.
I typically encourag
I've been lurking on this thread a little.
This discussion has gone “all over the place”, however I’d like to point out
that part of the reason NBSP has been used for thousands separators is because
that it exists in all of those legacy codepages that were mentioned predating
Unicode.
Whether
>> Keeping these applications outdated has no other benefit than providing a
>> handy lobbying tool against support of NNBSP.
I believe you’ll find that there are some French banks and other institutions
that depend on such obsolete applications (unfortunately).
Additionally, I believe you’ll fin
>> If they are obsolete apps, they don’t use CLDR / ICU, as these are designed
>> for up-to-date and fully localized apps. So one hassle is off the table.
Windows uses CLDR/ICU. Obsolete apps run on Windows. That statement is a
little narrowminded.
>> I didn’t look into these date interchange
I'm curious what you'd use it for?
From: Unicode On Behalf Of Slawomir Osipiuk via
Unicode
Sent: Friday, June 21, 2019 5:14 PM
To: unicode@unicode.org
Subject: Unicode "no-op" Character?
Does Unicode include a character that does nothing at all? I'm talking about
something that can be used for
+ the list. For some reason the list's reply header is confusing.
From: Shawn Steele
Sent: Saturday, June 22, 2019 4:55 PM
To: Sławomir Osipiuk
Subject: RE: Unicode "no-op" Character?
The original comment about putting it between the base character and the
combining diacritic seems peculiar.
Assuming you were using any of those characters as "markup", how would you know
when they were intentionally in the string and not part of your marking system?
-Original Message-
From: Unicode On Behalf Of Richard Wordingham via
Unicode
Sent: Saturday, June 22, 2019 4:17 PM
To: unicode@
But... it's not actually discardable. The hypothetical "packet" architecture
(using the term architecture somewhat loosely) needed the information being
tunneled in by this character. If it was actually discardable, then the "noop"
character wouldn't be required as it would be discarded.
Sinc
I think you're overstating my concern :)
I meant that those things tend to be particular to a certain context and often
aren't interesting for interchange. A text editor might find it convenient to
place word boundaries in the middle of something another part of the system
thinks is a single u
> From the point of view of Unicode, it is simpler: If the character is in use
> or have had use, it should be included somehow.
That bar, to me, seems too low. Many things are only used briefly or in a
private context that doesn't really require encoding.
The hieroglyphs discussion is interes
I'm not opposed to a sub-bloc for "Modern Hieroglyphs"
I confess that even though I know nothing about Hieroglyphs, that I find it
fascinating that such a thoroughly dead script might still be living in some
way, even if it's only a little bit.
-Shawn
-Original Message-
From: Unicode
IMO, encodings, particularly ones depending on state such as this, may have
multiple ways to output the same, or similar, sequences. When means that
pretty much any time an encoding transforms data any previous security or other
validation style checks are no longer valid and any security/valid
32 matches
Mail list logo