On 2024-06-29 08:53, Neville Smythe via use-livecode wrote:
Is it not the case that the processing time for looping over the number
of lines and getting the k-th line in each iteration, for some
arbitrary k, going to be
Order(N^2) * C * T
where
N = the number of lines in the text
C = the number of codepoints in each line
T = the (average) time for processing each codepoint to check for a
return character
Now N and C are the same whether the text is ascii or unicode
Largely - yes - although for stuff like this you need to think in terms
of bytes not codepoints (as memory throughput becomes 'a thing' when the
strings are anything longer than a few characters) - so unicode is
2*ascii in this regard
[ Its actually more than 2x for longer strings but how much more depends
on CPU/memory architecture - CPUs can only read from their level 1
cache, and there's a cost to a cache miss, and you get 2x as many cache
misses with unicode data as native data, assuming the data is larger
than a single level 1 cache line. ]
Test 1
If I get rid of the red herring by replacing the matchChunk call with a
simple
...
Which would appear to mean that processing unicode codepoints for the
apparently simple task of checking for return takes 100 times as much
time as processing ascii. That seems excessive, to put it mildly!
Its a lot slower certainly, but then searching unicode text for a string
is (in the general case) a lot more complex than searching native/ascii
text for a string.
Test 2
With the original code using matchChunk, which of course is going to
have its own internal loop on code points so multiply by another 8 (it
only searches the first few characters)
and will not always return true so more lines must be processed — but
again these are same multipliers whether ascii or unicode
...
Plain ascii takes 0.07 seconds
Unicode takes 19.9 seconds, a multiplier of nearly 300. — I can
easily believe matchChunk takes 3 times as long to process unicode as
ascii, this is the sort of factor I would have expected in Test 1.
So 'Test 2' is slightly misleading - as it still suggests matchChunk is
causing a slowdown - which it isn't.
The difference here is Test 2 is doing more work as it isn't always
exiting. If you test:
get line k of fff
put true into tFound
I suspect you'll find the time to process your data if it contains
unicode is pretty similar to that when matchChunk is also called.
In my quick test (which is 32 index lines, 200 fff lines) I get about
10ms (no unicode) vs 1400ms (unicode)
OK Mark, hit me again, I am a glutton for punishment, what is wrong
with this analysis?
Nothing in particular - apart from thinking that matchChunk is actually
a relevant factor here ;)
The reasons this delimiter search operation on unicode strings is so
much slower than native is for two reasons:
1) We (well, I) heavily optimized the core native/ascii string
operations in 2015 to make sure there were as fast as possible
2) Searching a unicode string for another string (which is what is
going on here) is much more complex than doing the same for native/ascii
Native/ascii strings have some very pleasant properties:
- one byte (codeunit) is one character - always.
- each character has only one representation - its byte value
- casing is a simple mapping between lower and upper case characters -
and only about 25% of characters are affected
Unicode has none of these properties
- a unicode character (grapheme) can be arbitrarily many codeunits (2
byte quantities) long
- characters can have multiple representations - e.g. e-acute vs
e,combining-acute
- casing is not (in general) a simple mapping of one codeunit to
another
Currently the unicode operations in the engine are largely unoptimized -
they assume the general case in all things so even searching a string
for LF (which is the case here) is still done under the assumption that
it might need that (very hefty) extra processing.
Of course it would be nice to have highly optimized low-level unicode
string optimizations but you have to pick your battles (particular when
the battles are incredibly technical ones!) but the reality is that this
(admittedly large!) speed difference is only really noticeable 'at
scale' and when scale is involved, there's pretty much always an
algorithmic change which can make those 'low-level' performance
differences irrelevant.
The case here is a really good example.
The line X based code gives (no matchChunk / with matchChunk):
ASCII 300 lines 13ms / 22ms
ASCII 3000 lines - 986ms / 1104ms
ASCII 10000 lines - 10804ms / 11213ms
The array based code gives (no matchChunk / with matchChunk):
ASCII 300 lines - 2ms / 11ms
ASCII 3000 lines - 19ms / 101ms
ASCII 10000 lines - 69ms / 336ms
UNICODE 300 lines - 7ms / 12ms
UNICODE 3000 lines - 52ms / 108ms
UNICODE 10000 lines - 170ms / 359ms
Warmest Regards,
Mark.
--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Build Amazing Things
_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode