Re: How to find offsets in Unicode Text fast

Mark Waddingham via use-livecode Mon, 12 Nov 2018 23:09:47 -0800

On 2018-11-13 06:35, Geoff Canyon via use-livecode wrote:

I didn't realize until now that offset() simply fails with some unicode
strings:
put offset("a","↘𠜎qeiuruioqeaaa↘𠜎qeiuar",13) -- puts 0

On Mon, Nov 12, 2018 at 9:17 PM Geoff Canyon <gcan...@gmail.com> wrote:
A few things:

1. It seems codepointOffset can only find a single character? So it
won't work for any search for a multi-character string?
2: codepointOffset seems to work differently for multi-byte charactersand
regular characters:

put codepointoffset("e","↘ndatestest",6) -- puts 3
put codepointoffset("e","andatestest",6) -- puts 9
3: It seems that when multi-byte characters are involved,codepointOffsetsuffers from the same sort of slow-down as offset does. For example,in a
145K string with about 20K hits for a single character, a simple
codepointOffset routine (below) takes over 10 seconds, while theitem-based
routine takes about 0.3 seconds for the same results.

There is something 'funky' going on with the *offset functions - eventaking into account that codeunitOffset/codepointOffset/byteOffsetreturn an absolute position rather than relative - I noticed somethingsimilar the other day, I'll endeavour to take a look.

Regardless of any gremlins, the speed difference (although I've not seenthe 'item-based' routine - so don't quite know what that is) is due tothe fact that in order to compare text you need to do a lot ofprocessing.

Unicode is a multi-code representation of text *regardless* of whateverencoding: a single (what humans understand to be) character can becomposed of multiple codes. For example, e-acute can be represented as[e-acute] or [e, combining-acute]. Indeed, Unicode allows an arbitrarysequence of combining marks to be attributed to a base character(whether your text rendering system can deal with such things is anothermatter). Some languages rely entirely on multi-code sequences - e.g. theIndic languages; and more recently, Emoji use various 'variationselectors' to allow variation in the base set of emoji - which arecomposed of multiple codes.

Therefore, the only difference between UTF-8, UTF-16 and UTF-32 as achosen representation is a balance between density, and storagerequirement based on language - processing wise they all have to do thesame 'yoga' to 'work' correctly when you are doing text processing:

- UTF8: preserves the NUL byte (important for C systems), ASCII is1-byte per code, European/Roman languages are generally at most 2-bytesper code, every other language 3-4 bytes per code (i.e. it heavilypenalises Asian languages).

- UTF16: does not preserve the NUL byte (as its a sequence of 16-bitunsigned integers), most languages in common use today (including Indicand CJK) are almost entirely encoded as 2-bytes per code, with someinterspersion with 4-bytes per code (via the 'surrogate pair'mechanism).

- UTF32: does not preserve the NUL byte (as its a sequence of 32-bitunsigned integers), all languages require 4 bytes per code, it wastes onaverage 15-bits per code (as almost all unicode codes currently definedrequire a maximum of 17-bits).

Indeed, in a fully correct *general* implementation of Unicodeprocessing, UTF32 will perform (on average) half as fast as aimplementation based on UTF16, and UTF8 will only perform better thanUTF16 *if* the majority of your texts are ASCII with occasionalnon-ASCII. (The latter is generally the case for European/Romanlanguages, but if you involve the Asian languages then it will, onaverage, be around 1/3rd slower).

[ This observation is based on the fact that all Unicode text processes- if implemented in the most efficient way - will end up being memorybound on modern processors and as such average speed will be based onhow many bytes the input string / output string take up ].

So, the reason 'offset' appears slow is that it is doing a lot more workthan you think. There are two principal processes which need to beapplied to Unicode text in order to do any sort of comparison:


   - normalization
   - case-folding

Case-folding applies if you want to do caseless rather thancase-sensitive comparison. It isn't *quite* upper-casing orlower-casing, but in fact a mapping from char->upper_char->lower_char sothat differences in case are removed. Case-folding is a little morecomplex than just code->code mappings - for example ß in German maps toSS in full generality. Also there are some hairy edge cases with'composed unicode characters' (particularly, but not exclusively whenconsidering polytonic Greek).

Normalization is the process of ensuring that two sequences of codes aremapped to canonically equivalent sequences of codes - it is *mostly*orthogonal to case-folding (at least if you normalize to the decomposedform). Basically the process of normalization discards the differencebetween sequences such as [e,combining-acute] and [e-acute]... However,again it isn't quite as simple as just mapping even direct sequences ascombining marks have a priority which means their order after the basecharacter doesn't matter. Similarly, you have things like the Hangulencoding of CJK (well, Korean really) ideographs - which compose anddecompose in a specialized way (which is best done algorithmically asits generally faster than doing it with a lookup table which woul dbehuge).

Anyway, upshot - offset has to do a lot of work when done in isolationas the engine doesn't know you are repeatedly using it. However, you canreduce its workload significantly by normalizing and case folding theinputs to it first, and then setting caseSenstive and formSensitive totrue:


   put normalizeText(toUpper(pNeedle), "nfd") into pNeedle
   put normalizeText(toUpper(pHaystack), "nfd") into pHaystack
   set the caseSensitive to true
   set the formSensitive to true

Here the caseSensitive property controls whether the engine case-foldsstrings before comparison, and formSensitive controls whether the enginenormalizes strings before comparison. They can be set independently(although I must confess I'd have to double check the utility ofcaseSensitive false but formSensitive true - for reasons alluded toabove!).

Note here I've used "NFD" which is the fully decomposed normalizationform (which means no composed characters such as e-acute will appear),and upper-casing as folding - which should cover both the issues withdoing caseless comparison on composed sequences, as well as edge caseslike ß. There's a couple of engine additions which would be useful here- one which gives direct access to the 'case-folding' unicode does, andone which returns a string pre-processed by the settings ofcase/formSensitive.

With the above (modulo bugs in offset and friends - which appear to berearing their heads based on Geoff's comments), then offset should runmuch quicker.


Warmest Regards,

Mark.

P.S. I did notice an issue with toUpper/toLower recently - when theinput string is Unicode then it applies the 'full' Unicode rules,including the German ß mapping; if however the input is native then itdoes not. For backwards compatibility sake (and the fact that mostpeople don't expect uppercasing/lowercasing to change the number ofcharacters in a string), it should probably only use simple rules -unless you explicitly ask it to. Monte did some work last week to startadding a second 'locale' parameter to control the precise (complex)mappings (casing behavior is actually defined by language/region - notby character - e.g. Turkish/Azeri both have a dotted and dotless I - andthere is variation as to what happens when one or other is uppercased,based on what precise language is being represented).


--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: How to find offsets in Unicode Text fast

Reply via email to