On 5/9/18, 23:14, "Makarius" <makar...@sketis.net> wrote:
On 28/08/18 03:00, michael.norr...@data61.csiro.au wrote: >> >> - there is evidently a bug in the position information when the leading delimiter is a Unicode quote mark (the number of bytes gets counted rather than the number of "characters"); > With Unicode text it is inherently difficult to say how "characters" are > counted, and thus to say what is right or wrong. Whatever you do is > likely to be wrong in some sense. > For internal counting, I would prefer the raw byte address, as we have > it already in our fine SML strings that are undiluted by Unicode. It > allows to access large text chunks quickly without recoding back and forth. We have committed to UTF8 everywhere internally and expect to consume/produce that to/from files and streams, meaning that we don't need to do any recoding per se. This does mean that we are utterly incompatible with environments that are not UTF8, and have to occasionally take care not to break strings badly (String.substring can not be blindly applied to byte sequences that are "secretly" in UTF8 for example). > For external purposes, e.g. error messages with characters positions, it > is hard to tell. It depends on the "consumers" that you have in mind: a > Java front-end is likely to expect 16-bit Char addresses. Unix tools are > likely to expect UTF-8 characters in the sense of codepoints. Nothing of > this is smallest textual unit, if you allow official Unicode in its full > complexity (but nobody has implemented that correctly anyway). As I'm willing to ignore combining characters etc, my desired behaviour (not yet realised as above bug demonstrates) is to indeed take the "Unix" attitude of reporting codepoint offsets. Michael ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ hol-info mailing list hol-info@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/hol-info