[Gen-art] Re: Genart early review of draft-bray-unichars-09

Dale R. Worley Wed, 30 Oct 2024 19:26:23 -0700

Of course, I leave all of this to the authors and ultimately the working
group.  But a few comments:

Tim Bray <tb...@textuality.com> writes:
> You have a point, but Iâm reluctant, for several reasons. First, I disagree
> that the doc is organized per PRECIS; in fact, it makes use of exactly zero
> of the considerable apparatus that PRECIS builds to support its profile
> definitions. First, the organization is, rather, âHere the problems with
> code points, and here are three subsets that, to a varying degree, exclude
> them, and here are the necessary declarations to use these as PRECIS
> profiles.â  Second, I donât want to create the expectation that every
> off-the-shelf PRECIS library will know about âXML charactersâ or 
> âUnicode
> assignableâ. Third, I donât want to create the impression that a specifier
> must understand PRECIS to use Unichars. PRECIS is a big and complicated,
> really quite a heavy lift.

Hmmmm.  Is it possible to get that point (those points!) across briefly
in section 1?  When I was writing the review, I had the feeling that
this I-D is somehow founded on PRECIS.  I mean, all of the IANA actions
are additions to PRECIS tables.  And literally 4% of the words in the
Abstract are "PRECIS".  Clarifying how Unichars related to PRECIS would
be helpful.

>> This is an awkward mix of singular and plural usages.  Inquire of
>> Editor the best way to phrase this.

I've gradually come to believe that when one is talking of generic
entities, it's easier to make the sentences work if you use singulars
whenever possible.

>> I think the usual terminology would be "variable-length sequences of
>> 8-bit chunks" or better "variable-length sequences of octets".
>
> Really? The document is written for programmers for whom âvariable-length
> byte sequencesâ is super-idiomatic.

Yes, it is idiomatic.  OTOH, RFCs seem to have labored long and hard to
not use "byte". ... OTOOH, about as many RFCs use octet (3490) as byte
(3531).  Wikipedia says

    The octet is a unit of digital information in computing and
    telecommunications that consists of eight bits. The term is often
    used when the term byte might be ambiguous, as the byte has
    historically been used for storage units of a variety of sizes. 

Historically, I would expect the distinction to be driven by the use of
the PDP-10.  I suspect my personal attitude is driven by the fact that
the SIP RFCs consistently use octets or characters, as needed, and not
bytes.

And the matter hasn't been settled recently; RFC 9659 and RFC 9661 are
the most recent contrasting pair.

Really, the Editor should have an opinion about this.

>>   [RFC9413] emphasizes that when encountering problematic input,
>>   software should consider the field as a whole, not individual code
>>   points or bytes.
>>
>> This needs to be clarified; RFC 9413 does not contain the word
>> "field", and only one instance of "as a whole" (in the phrase
>> "protocol as a whole").
>
> This is embarrassing. In commentary on an earlier draft, a person I tend to
> believe (I forget who) said âOf course, RFC9413 saysâ¦â and that language
> sounded wise and we included it without checking. In fact, 9413 says no
> such thing.  I still think itâs a sensible idea and, while I would like to
> have an actually-accurate citation, would also like to retain the
> suggestion even if we canât.

Oh, yes, I didn't mean to omit the guidance, rather to make sure there's
a good pointer to the discussion of *why* just e.g. dropping individual
bogus bytes may not be a good strategy, given that it is obvious (and
simple to implement).

>>   [...] surrogates, legacy C0 Controls, and the noncharacters U+FFFE [...]
>>
>> The phrase "legacy C0 Controls" is not defined.  I think you mean "C0
>> Controls".
>
> The phrase âLegacy Controls which are C0 Controlsâ relies only on defined
> words. I think it might be forgivable to include the âC0â in the middle of
> the term defined in 2.2.2.2 as more readable?

My reflex is to disagree, but I am picky about using defined terms
exactly.  I mean, exactly what does "legacy C0 controls" mean if the
term is used nowhere else in this document?

The underlying problem is that while everybody else sensibly considers
the C1 controls pretty much like the C0 controls (as in "legacy
controls"), XML for some reason treats the C1 controls like normal
characters.

>> It isn't your problem, but currently the URL
>> <http://www.unicode.org/versions/latest/> goes to a page titled
>> "Unicode(R) 16.0.0", but that page gives only a summary of changes,
>> not the contents of Unicode 16.  You have to go to
>> e.g. <https://www.unicode.org/versions/Unicode15.0.0/> to see the
>> standard.
>
> Yes, and Iâve been working to try to find the right Unicode person to yell
> at about this.

Heh!

Dale

_______________________________________________
Gen-art mailing list -- gen-art@ietf.org
To unsubscribe send an email to gen-art-le...@ietf.org

[Gen-art] Re: Genart early review of draft-bray-unichars-09

Reply via email to