On 5/9/18, 23:14, "Makarius" <makar...@sketis.net> wrote:

    On 28/08/18 03:00, michael.norr...@data61.csiro.au wrote:
    >> 
    >> - there is evidently a bug in the position information when the leading 
delimiter is a Unicode quote mark (the number of bytes gets counted rather than 
the number of "characters");
    
    > With Unicode text it is inherently difficult to say how "characters" are
    > counted, and thus to say what is right or wrong. Whatever you do is
    > likely to be wrong in some sense.
    
    > For internal counting, I would prefer the raw byte address, as we have
    > it already in our fine SML strings that are undiluted by Unicode. It
    > allows to access large text chunks quickly without recoding back and 
forth.

We have committed to UTF8 everywhere internally and expect to consume/produce 
that to/from files and streams, meaning that we don't need to do any recoding 
per se. This does mean that we are utterly incompatible with environments that 
are not UTF8, and have to occasionally take care not to break strings badly 
(String.substring can not be blindly applied to byte sequences that are 
"secretly" in UTF8 for example). 
    
    > For external purposes, e.g. error messages with characters positions, it
    > is hard to tell. It depends on the "consumers" that you have in mind: a
    > Java front-end is likely to expect 16-bit Char addresses. Unix tools are
    > likely to expect UTF-8 characters in the sense of codepoints. Nothing of
    > this is smallest textual unit, if you allow official Unicode in its full
    > complexity (but nobody has implemented that correctly anyway).
    
As I'm willing to ignore combining characters etc, my desired behaviour (not 
yet realised as above bug demonstrates) is to indeed take the "Unix" attitude 
of reporting codepoint offsets.

Michael
    

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
hol-info mailing list
hol-info@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/hol-info

Reply via email to