On 2024-06-29 08:53, Neville Smythe via use-livecode wrote:
Is it not the case that the processing time for looping over the number of lines and getting the k-th line in each iteration, for some arbitrary k, going to be

Order(N^2) * C * T

where

N = the number of lines in the text

C = the number of codepoints in each line

T = the (average) time for processing each codepoint to check for a return character

Now N and C are the same whether the text is ascii or unicode

Largely - yes - although for stuff like this you need to think in terms of bytes not codepoints (as memory throughput becomes 'a thing' when the strings are anything longer than a few characters) - so unicode is 2*ascii in this regard

[ Its actually more than 2x for longer strings but how much more depends on CPU/memory architecture - CPUs can only read from their level 1 cache, and there's a cost to a cache miss, and you get 2x as many cache misses with unicode data as native data, assuming the data is larger than a single level 1 cache line. ]

Test 1
If I get rid of the red herring by replacing the matchChunk call with a simple
...
Which would appear to mean that processing unicode codepoints for the apparently simple task of checking for return takes 100 times as much time as processing ascii. That seems excessive, to put it mildly!

Its a lot slower certainly, but then searching unicode text for a string is (in the general case) a lot more complex than searching native/ascii text for a string.

Test 2
With the original code using matchChunk, which of course is going to have its own internal loop on code points so multiply by another 8 (it only searches the first few characters) and will not always return true so more lines must be processed — but again these are same multipliers whether ascii or unicode
...
Plain ascii takes   0.07 seconds
Unicode takes 19.9 seconds, a multiplier of nearly 300. — I can easily believe matchChunk takes 3 times as long to process unicode as ascii, this is the sort of factor I would have expected in Test 1.

So 'Test 2' is slightly misleading - as it still suggests matchChunk is causing a slowdown - which it isn't.

The difference here is Test 2 is doing more work as it isn't always exiting. If you test:

  get line k of fff
  put true into tFound

I suspect you'll find the time to process your data if it contains unicode is pretty similar to that when matchChunk is also called.

In my quick test (which is 32 index lines, 200 fff lines) I get about 10ms (no unicode) vs 1400ms (unicode)

OK Mark, hit me again, I am a glutton for punishment, what is wrong with this analysis?

Nothing in particular - apart from thinking that matchChunk is actually a relevant factor here ;)

The reasons this delimiter search operation on unicode strings is so much slower than native is for two reasons: 1) We (well, I) heavily optimized the core native/ascii string operations in 2015 to make sure there were as fast as possible 2) Searching a unicode string for another string (which is what is going on here) is much more complex than doing the same for native/ascii

Native/ascii strings have some very pleasant properties:
  - one byte (codeunit) is one character - always.
  - each character has only one representation - its byte value
- casing is a simple mapping between lower and upper case characters - and only about 25% of characters are affected

Unicode has none of these properties
- a unicode character (grapheme) can be arbitrarily many codeunits (2 byte quantities) long - characters can have multiple representations - e.g. e-acute vs e,combining-acute - casing is not (in general) a simple mapping of one codeunit to another

Currently the unicode operations in the engine are largely unoptimized - they assume the general case in all things so even searching a string for LF (which is the case here) is still done under the assumption that it might need that (very hefty) extra processing.

Of course it would be nice to have highly optimized low-level unicode string optimizations but you have to pick your battles (particular when the battles are incredibly technical ones!) but the reality is that this (admittedly large!) speed difference is only really noticeable 'at scale' and when scale is involved, there's pretty much always an algorithmic change which can make those 'low-level' performance differences irrelevant.

The case here is a really good example.

The line X based code gives (no matchChunk / with matchChunk):

  ASCII 300 lines  13ms / 22ms
  ASCII 3000 lines - 986ms / 1104ms
  ASCII 10000 lines - 10804ms / 11213ms

The array based code gives (no matchChunk / with matchChunk):

  ASCII 300 lines - 2ms / 11ms
  ASCII 3000 lines - 19ms / 101ms
  ASCII 10000 lines - 69ms / 336ms

  UNICODE 300 lines - 7ms / 12ms
  UNICODE 3000 lines - 52ms / 108ms
  UNICODE 10000 lines - 170ms / 359ms

Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Build Amazing Things

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to