On 28/08/18 03:00, michael.norr...@data61.csiro.au wrote:
> 
> - there is evidently a bug in the position information when the leading 
> delimiter is a Unicode quote mark (the number of bytes gets counted rather 
> than the number of "characters");

With Unicode text it is inherently difficult to say how "characters" are
counted, and thus to say what is right or wrong. Whatever you do is
likely to be wrong in some sense.

Many people think of Unicode characters in the sense of Java or
JavaScript, which used to be UCS-2, but is now UTF-16 (with its 16 or 32
bit chars).

The website https://utf8everywhere.org provides very useful links and
explanations on the many confusions around Unicode, and hints to avoid
most of them.


For internal counting, I would prefer the raw byte address, as we have
it already in our fine SML strings that are undiluted by Unicode. It
allows to access large text chunks quickly without recoding back and forth.

For external purposes, e.g. error messages with characters positions, it
is hard to tell. It depends on the "consumers" that you have in mind: a
Java front-end is likely to expect 16-bit Char addresses. Unix tools are
likely to expect UTF-8 characters in the sense of codepoints. Nothing of
this is smallest textual unit, if you allow official Unicode in its full
complexity (but nobody has implemented that correctly anyway).


        Makarius

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
hol-info mailing list
hol-info@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/hol-info

Reply via email to