On 2018-11-13 06:35, Geoff Canyon via use-livecode wrote:
I didn't realize until now that offset() simply fails with some unicode
strings:

put offset("a","↘𠜎qeiuruioqeaaa↘𠜎qeiuar",13) -- puts 0

On Mon, Nov 12, 2018 at 9:17 PM Geoff Canyon <gcan...@gmail.com> wrote:

A few things:

1. It seems codepointOffset can only find a single character? So it
won't work for any search for a multi-character string?
2: codepointOffset seems to work differently for multi-byte characters and
regular characters:

put codepointoffset("e","↘ndatestest",6) -- puts 3
put codepointoffset("e","andatestest",6) -- puts 9

3: It seems that when multi-byte characters are involved, codepointOffset suffers from the same sort of slow-down as offset does. For example, in a
145K string with about 20K hits for a single character, a simple
codepointOffset routine (below) takes over 10 seconds, while the item-based
routine takes about 0.3 seconds for the same results.

There is something 'funky' going on with the *offset functions - even taking into account that codeunitOffset/codepointOffset/byteOffset return an absolute position rather than relative - I noticed something similar the other day, I'll endeavour to take a look.

Regardless of any gremlins, the speed difference (although I've not seen the 'item-based' routine - so don't quite know what that is) is due to the fact that in order to compare text you need to do a lot of processing.

Unicode is a multi-code representation of text *regardless* of whatever encoding: a single (what humans understand to be) character can be composed of multiple codes. For example, e-acute can be represented as [e-acute] or [e, combining-acute]. Indeed, Unicode allows an arbitrary sequence of combining marks to be attributed to a base character (whether your text rendering system can deal with such things is another matter). Some languages rely entirely on multi-code sequences - e.g. the Indic languages; and more recently, Emoji use various 'variation selectors' to allow variation in the base set of emoji - which are composed of multiple codes.

Therefore, the only difference between UTF-8, UTF-16 and UTF-32 as a chosen representation is a balance between density, and storage requirement based on language - processing wise they all have to do the same 'yoga' to 'work' correctly when you are doing text processing:

- UTF8: preserves the NUL byte (important for C systems), ASCII is 1-byte per code, European/Roman languages are generally at most 2-bytes per code, every other language 3-4 bytes per code (i.e. it heavily penalises Asian languages).

- UTF16: does not preserve the NUL byte (as its a sequence of 16-bit unsigned integers), most languages in common use today (including Indic and CJK) are almost entirely encoded as 2-bytes per code, with some interspersion with 4-bytes per code (via the 'surrogate pair' mechanism).

- UTF32: does not preserve the NUL byte (as its a sequence of 32-bit unsigned integers), all languages require 4 bytes per code, it wastes on average 15-bits per code (as almost all unicode codes currently defined require a maximum of 17-bits).

Indeed, in a fully correct *general* implementation of Unicode processing, UTF32 will perform (on average) half as fast as a implementation based on UTF16, and UTF8 will only perform better than UTF16 *if* the majority of your texts are ASCII with occasional non-ASCII. (The latter is generally the case for European/Roman languages, but if you involve the Asian languages then it will, on average, be around 1/3rd slower).

[ This observation is based on the fact that all Unicode text processes - if implemented in the most efficient way - will end up being memory bound on modern processors and as such average speed will be based on how many bytes the input string / output string take up ].

So, the reason 'offset' appears slow is that it is doing a lot more work than you think. There are two principal processes which need to be applied to Unicode text in order to do any sort of comparison:

   - normalization
   - case-folding

Case-folding applies if you want to do caseless rather than case-sensitive comparison. It isn't *quite* upper-casing or lower-casing, but in fact a mapping from char->upper_char->lower_char so that differences in case are removed. Case-folding is a little more complex than just code->code mappings - for example ß in German maps to SS in full generality. Also there are some hairy edge cases with 'composed unicode characters' (particularly, but not exclusively when considering polytonic Greek).

Normalization is the process of ensuring that two sequences of codes are mapped to canonically equivalent sequences of codes - it is *mostly* orthogonal to case-folding (at least if you normalize to the decomposed form). Basically the process of normalization discards the difference between sequences such as [e,combining-acute] and [e-acute]... However, again it isn't quite as simple as just mapping even direct sequences as combining marks have a priority which means their order after the base character doesn't matter. Similarly, you have things like the Hangul encoding of CJK (well, Korean really) ideographs - which compose and decompose in a specialized way (which is best done algorithmically as its generally faster than doing it with a lookup table which woul dbe huge).

Anyway, upshot - offset has to do a lot of work when done in isolation as the engine doesn't know you are repeatedly using it. However, you can reduce its workload significantly by normalizing and case folding the inputs to it first, and then setting caseSenstive and formSensitive to true:

   put normalizeText(toUpper(pNeedle), "nfd") into pNeedle
   put normalizeText(toUpper(pHaystack), "nfd") into pHaystack
   set the caseSensitive to true
   set the formSensitive to true

Here the caseSensitive property controls whether the engine case-folds strings before comparison, and formSensitive controls whether the engine normalizes strings before comparison. They can be set independently (although I must confess I'd have to double check the utility of caseSensitive false but formSensitive true - for reasons alluded to above!).

Note here I've used "NFD" which is the fully decomposed normalization form (which means no composed characters such as e-acute will appear), and upper-casing as folding - which should cover both the issues with doing caseless comparison on composed sequences, as well as edge cases like ß. There's a couple of engine additions which would be useful here - one which gives direct access to the 'case-folding' unicode does, and one which returns a string pre-processed by the settings of case/formSensitive.

With the above (modulo bugs in offset and friends - which appear to be rearing their heads based on Geoff's comments), then offset should run much quicker.

Warmest Regards,

Mark.

P.S. I did notice an issue with toUpper/toLower recently - when the input string is Unicode then it applies the 'full' Unicode rules, including the German ß mapping; if however the input is native then it does not. For backwards compatibility sake (and the fact that most people don't expect uppercasing/lowercasing to change the number of characters in a string), it should probably only use simple rules - unless you explicitly ask it to. Monte did some work last week to start adding a second 'locale' parameter to control the precise (complex) mappings (casing behavior is actually defined by language/region - not by character - e.g. Turkish/Azeri both have a dotted and dotless I - and there is variation as to what happens when one or other is uppercased, based on what precise language is being represented).

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to