[Gen-art] Re: Genart early review of draft-bray-unichars-09

Tim Bray Sat, 26 Oct 2024 17:11:37 -0700

On Oct 20, 2024 at 7:51:34 PM, Dale Worley via Datatracker <nore...@ietf.org>
wrote:


> Reviewer: Dale Worley
> Review result: Ready with Issues
>

First of, thanks Dale, and this is an example of the last-call process
being useful, getting a fresh pair of eyes onto the doc unsurprisingly
turns up a bunch of issues.


> Check whether "numeric values", "code points", and "characters" are
> used correctly throughout the document.  I don't have a good sense of
> the proper usage of these terms regarding Unicode, but I have a sense
> (that might be incorrect) that "code point" is a subclass of "numeric
> value", and should always be used when referring the the number
> representing a character.
>

This is draft-09 and has been chewed over by multiple people who I consider
to be pretty extreme Unicode pedants, and I’m comfortable with this
terminology as it stands now. Happy to hear specific objections.

You probably want to ASCII-ize various quote symbols used in the
> document.  I'm not sure how the Editor wants to handle the "black
> heart" characters, but they are informative examples and ought to be
> retained if possible.
>

Ouch, yes.

It would be useful to describe here why the newly defined subsets are
> superior to the two existing subsets.
>

Hmm. Something like “The new subset, “Unicode Assignables”, is reduced in
size compared to the others and excludes all the code points which are
considered “problematic” according to this document’s criteria."

Also, this statement is incorrect; the document defines four new
> subsets, comprising one base class and three profiles.
>

You’re correct, but the upcoming draft-10 removes the base class, so the
correct number is back to three.

1.1.  Notation
>
>   In the text, Unicode’s standard "U+",
>   zero-padded to four places, is used.  For example, "A", decimal 65,
>   would be expressed as U+0041, and "🖤" (Black Heart), decimal 128,420,
>   would be U+1F5A4.
>
> This seems awkward to me.  Perhaps:
>
>   In the text, we use Unicode’s standard notation of "U+" followed by
>   four or more hexadecimal digits.  For example, "A", decimal 65,
>   is expressed as U+0041, and "🖤" (Black Heart), decimal 128,420,
>   is U+1F5A4.
>

Works for me. Any objections?

  The subsets are described both in ABNF and as PRECIS profiles
>   [RFC8264].
>
> This is correct, but ... The entire document is organized as being
> within the PRECIS conceptual framework, and yet the references to
> PRECIS are all phrased as pointers to various parts of the PRECIS
> RFCs, not to the whole.  The document should "at the top level" (it
> seems like this means in section 1) state that it is part of, or
> within, the PRECIS framework, and reference the relevant PRECIS RFCs
> at that point.  The later references to PRECIS can then be omitted
> unless they are to specific sections of RFCs that are relevant to the
> particular reference.
>

You have a point, but I’m reluctant, for several reasons. First, I disagree
that the doc is organized per PRECIS; in fact, it makes use of exactly zero
of the considerable apparatus that PRECIS builds to support its profile
definitions. First, the organization is, rather, “Here the problems with
code points, and here are three subsets that, to a varying degree, exclude
them, and here are the necessary declarations to use these as PRECIS
profiles.”  Second, I don’t want to create the expectation that every
off-the-shelf PRECIS library will know about “XML characters” or “Unicode
assignable”. Third, I don’t want to create the impression that a specifier
must understand PRECIS to use Unichars. PRECIS is a big and complicated,
really quite a heavy lift.

2.  Characters and Code Points
>
>   However, each Unicode character is assigned a code
>   point, used to represent the characters in computer memory and
>   storage systems and, in specifications, to specify allowed subsets.
>
> This is an awkward mix of singular and plural usages.  Inquire of
> Editor the best way to phrase this.
>

You are correct. We may try to address this but throwing it to the editor
is not a bad idea.

  Section 6.1 defines a new PRECIS base class that encompasses all
>   Unicode code points.  This base class is used for the PRECIS profiles
>   for the subsets defined in this document.
>
> Would be a little clearer as
>

You are right, but there are a bunch of problems with the whole base class
that came up in recent discussion, the next draft just abandons it; it
turns out you can define a PRECIS profile without needing a base class.

2.1.  Transformation Formats
>
>   However, it is useful
>   to note that the "UTF-16" format represents each code point with one
>   or two 16-bit chunks, and the “UTF-8” format uses variable-length
>   byte sequences.
>
> I think the usual terminology would be "variable-length sequences of
> 8-bit chunks" or better "variable-length sequences of octets".
>

Really? The document is written for programmers for whom “variable-length
byte sequences” is super-idiomatic.

2.2.  Problematic Code Points
>
>   [...] would benefit from careful consideration of the issues
>   described by PRECIS; [...]
>
> It seems to me this ought to specify where these issues are described.
>

Not sure… PRECIS mixes motivation and implementation quite a bit. Anyone
have any specific suggestions for places in there that we should highlight?

  Definition D10a in section 3.4 of [UNICODE] defines seven code point
>   types.  Three types of code points are assigned to entities which are
>   not actually characters or whose value as Unicode characters in text
>   fields is questionable: "Surrogate", "Control", and "Noncharacter".
>   In this document, "problematic" refers to code points whose type is
>   "Surrogate" or "Noncharacter", and to "legacy controls" as defined in
>   Section 2.2.2.2.
>
> Given that "section 3.4" at the beginning of the paragraph refers to
> [UNICODE], it might be clearer to say "as defined in Section 2.2.2.2
> of this document" or "as defined in Section 2.2.2.2 below".
>

Right.

2.2.1.  Surrogates
>
>   A total of 2,048 code points, in the range U+D800-U+DFFF, are divided
>
> Since "the range" consists of 2,048 code points, this can be said more
> exactly:
>
>   A total of 2,048 code points, the range U+D800-U+DFFF, are divided
>

OK

Also, doesn't "total" take a singular verb?  Or is that an
> Americanism?
>

Indeed. “The total is 12” not “the total are 12”.

2.2.2.2.  Legacy Controls
>
>   Aside from the useful controls, the control codes are mostly obsolete
>
> I think you need to capitalize "Control Codes" here.
>

Right.

2.2.3.  Noncharacters
>
> It seems, looking at rule D15 of section 3.4 of Unicode 15.0.0 shows
> "noncharacter" as not intrinsically capitalized in Unicode usage.  But
> rule D10a shows "Noncharacter" as intrinsically capitalized.  Perhaps
> ask the Editor about this.
>

This is not the first inconsistency in [Unicode] language that Unichars has
turned up. Delighted to toss this one to the editor.

3.  Dealing With Problematic Code Points
>
>   [RFC9413], "Maintaining Robust Protocols", provides a thorough
>   discussion of strategies for dealing with issues in input data, for
>   example problematic code points.
>
> Probably better to use "including" in place of "for example".
>

OK.

  [...] can be
>   used in attacks based on misleading human readers of text that
>   attempt to display them [TR36].
>
> Text does not itself attempt attempt anything.  Better is "attacks
> based on attempting to display text that includes them".
>

Right.

  [...] differs in programming-language implementations [...]
>
> I would say "differs between".
>

Or “among”.

  Thus, in theory, if a
>   specification requires that input data be encoded with UTF-8,
>   implementors should never have to concern themselves with surrogates.
>
> This sentence doesn't make sense to me.  If a specification requires
> something, there is no "in theory" which implies that the input data
> will conform to the specification.  Perhaps something like
>
>   Section 3.9 of [UNICODE] makes it clear that a UTF-8 byte sequence
>   which would map to a surrogate is ill-formed.  If a specification
>   requires that input data be encoded with UTF-8, and all input were
>   well-formed, implementors would never have to concern themselves
>   with surrogates.
>

That is better. I’d put another “if” before “all input”.

But it's not clear to me that the second sentence adds any useful
> information.  It seems that the paragraph could just continue with the
> next sentence:
>
>   Unfortunately, industry experience teaches that problematic code
>   points, including surrogates, can and do occur in program input where
>   the source of input data is not controlled by the implementor.
>

I agree. But several participants in the discussion were adamant that
saying “input is UTF-8” is sufficient, and while I don’t agree, I think
it’s OK to leave in this text acknowledging that, really, saying “UTF-8”
*should* be enough.

If the source of the data is controlled by the implementor, it isn't
> "input".  So it seems to me that "where the source of input data is
> not controlled by the implementor" can be omitted.
>

Unconvinced. I think calling out that the incoming data may contain things
that do not match with the label on the tin saying for example “this is
UTF-8”.  This feels like traditional spec language describing that
situation and I’m not sure losing it increases clarity.

  In
>   particular, the specification of JSON allows any code point to appear
>   in object member names and string values [RFC8259]; the following is
>   a conforming JSON text:
>
> It seems like this should start a new paragraph, and be prefixed with
> "For example,".
>

OK.

  Reasonable options for dealing with problematic input include, first,
>   rejecting text containing problematic code points, and second,
>   replacing them with placeholders.  (As an exception, [UNICODE] notes
>   that it may in some cases be appropriate, specifically for
>   noncharacters, to treat them as non-problematic unassigned code
>   points.)
>
> I think you can omit "As an exception", since the parenthesized
> sentence already contains "may in some cases be appropriate".
>

Perhaps “on the other hand” or just “however". Now I’m wondering if the
parentheses add any value. Are there three equally-plausible alternatives,
or two plus this frankly weird suggestion from [UNICODE]?  Maybe just lose
the whole mention?  Other suggestions?

  Silently deleting an ill-formed part of a string is a known security
>   risk.
>
> It seems well worth referencing a discussion of the "known security
> risk".
>

Will look.

  [RFC9413] emphasizes that when encountering problematic input,
>   software should consider the field as a whole, not individual code
>   points or bytes.
>
> This needs to be clarified; RFC 9413 does not contain the word
> "field", and only one instance of "as a whole" (in the phrase
> "protocol as a whole").
>

This is embarrassing. In commentary on an earlier draft, a person I tend to
believe (I forget who) said “Of course, RFC9413 says…” and that language
sounded wise and we included it without checking. In fact, 9413 says no
such thing.  I still think it’s a sensible idea and, while I would like to
have an actually-accurate citation, would also like to retain the
suggestion even if we can’t.


4.1.  Unicode Scalars
>
>   This subset is called the UnicodeScalarsClass for use in PRECIS.
>
> This is awkward.  Why not:
>
>   This subset is the PRECIS profile UnicodeScalarsClass.
>

Not sure. We’re saying “while we call this “Unicode Scalars”, it’s name in
the PRECIS context is…”

4.2.  XML Characters
>
>   [...] surrogates, legacy C0 Controls, and the noncharacters U+FFFE [...]
>
> The phrase "legacy C0 Controls" is not defined.  I think you mean "C0
> Controls".
>

The phrase “Legacy Controls which are C0 Controls” relies only on defined
words. I think it might be forgivable to include the “C0” in the middle of
the term defined in 2.2.2.2 as more readable?

4.3.  Unicode Assignables
>
>   This subset comprises
>   all code points that are currently assigned, or might in future be
>   assigned, to characters that are not legacy control codes.
>
> This is awkward because it seems be careful to exclude "code points
> that might in future be assigned to characters that are legacy control
> codes", and of course there are none of those.  Probably better:
>
>   This subset comprises
>   all code points that are currently assigned,
>   excluding legacy control codes, or that might in future be
>   assigned.
>

Right.

5.  Using Subsets
>
>   These formats specify default subsets.
>
> This is unclear.  Do you mean
>
>   These specifications specify default subsets of Unicode for use in
>   their protocols.
>

Right.

  Note that escaping techniques such as those in the JSON example in
>   Section 3 cannot be used to circumvent this sort of restriction,
>   which applies to data content, not textual representation in
>   packaging formats.
>
> This could be clarified.  Perhaps
>
>   A restriction placed on the contents of a name or value would not
>   be circumventable by an escaping technique (such as those in the
>   JSON example in Section 3) because the restriction applies to the
>   data content, not the textual representation of the content.
>

The two read almost identically to me. Anyone else have an opinion?

6.1.  Addition to the PRECIS Base Classes Registry
>
>   Reference: Section 2 of this RFC
>
> This isn't flagged explicitly for Editor/IANA attention.  That may be
> OK, but usually these items are marked explicitly.  See also other
> occurrences of "this RFC".
>

Anyhow, the base class is going away.

6.2.3.  Unicode Assignables Profile
>
>   Applicability: Protocols that want to allow all Unicode code points
>   that are currently assigned, or might be assigned in the future, to
>   characters that are not "legacy controls" as defined in
>   Section 2.2.2.2
>
> It seems like this should be "section 2.2.2.2 of [this RFC]".
>

Right.

Also, see the comment for section 4.3.
>
> 7.  Security Considerations
>
> It might be worth pointing to section 3 here, as that section contains
> some security considerations, and points to security considerations
> documented elsewhere.
>

Right.

  Note that the Unicode-character subsets specified in this document
>   include a successively-decreasing number of problematic code points,
>   [...]
>
> It might be worth explicitly saying "problematic code points (as
> defined in section 2.2)" so section 7 can be read correctly by someone
> who hasn't read the rest of the document.
>

Is that really a goal?

8.  Normative References
>
>   [UNICODE]  The Unicode Consortium, "The Unicode Standard",
>              <http://www.unicode.org/versions/latest/>.  Note that this
>              reference is to the latest version of Unicode, rather than
>              to a specific release.  It is not expected that future
>              changes in the Unicode Standard will affect the referenced
>              definitions.
>
> It isn't your problem, but currently the URL
> <http://www.unicode.org/versions/latest/> goes to a page titled
> "Unicode(R) 16.0.0", but that page gives only a summary of changes,
> not the contents of Unicode 16.  You have to go to
> e.g. <https://www.unicode.org/versions/Unicode15.0.0/> to see the
> standard.
>

Yes, and I’ve been working to try to find the right Unicode person to yell
at about this. This is awful because plenty of specifications, in the IETF
and elsewhere, rely on that URL, which in my view has been gratuitously
broken in the latest revision.


> [END]
>
>
>
>

_______________________________________________
Gen-art mailing list -- gen-art@ietf.org
To unsubscribe send an email to gen-art-le...@ietf.org

[Gen-art] Re: Genart early review of draft-bray-unichars-09

Reply via email to