On 2024-06-28 11:15, Neville Smythe via use-livecode wrote:
In my last epistle I mentioned the repeat loop had only 32 iterations.
Much more relevant of course is the inner loop on the number of lines
of the data variable fff. In this case fff had 1760 lines. So the total
possible number of iterations was around 30000 to 50000, getting up
there but still well within LC capability.
I tried operating the loop on just the first 100 lines of fff. The
repeats took 0.006 seconds (impressive).
Then just the first 300 lines. Timing was 0.016 seconds, approximately
linear increase as expected
Are you sure a linear increase is expected?
Then the first 500 lines. Timing is now 6.43 seconds. Something very
odd there, that’s beyond exponential increase.
And on the last 500 lines, timing was 0.135 seconds (Aha !!!)
I take it you tested this by doing the loop where fff was just line 1 to
300, then just line 1 to 500 and then just -500 to -1?
My guess here is that the first 300 lines do not have a unicode
character, there is somewhere in the next 200, and there are none in the
last 500.
This would seem to point to matchChunk having indigestion over
something in the middle of the text data. The data is 98% plain ascii,
but quite likely it has a few Unicode characters: I have taken a whole
lot of time to convert my legacy app to accept Asian and European
Unicode personal names.
All regex matches in LC are Unicode (under the hood) - so thinking this
is regex related is a red-herring. Running a regex on Unicode text, is
no different from on native text - its just its dealing with 16-bit
units rather than 8-bit (regex is generally a linear operation - it
takes time proportional, roughly to length of pattern * length of string
- well, as long as you don't use any backtracking type features).
The issue here is the assumption that your code is doing something
linear...
It *looks* linear because your code is only doing two nested repeat
loops - so from the point of view of lines of script its iterating the
central loop roughly 'the number of lines of indexList * the number of
lines of fff' - however the engine is doing a lot more work.
The central loop is doing (paraphrased);
repeat with x = 1 to N
get line x of jjj
end repeat
This means the engine is searching for a line delimiter not N times, but
sum(1, ..., N) times (which is N*(N-1)/2 - i.e. N^2 roughly).
Processing 300 lines will take the engine about 45000 steps, processing
500 lines will take the engine 125000 steps, processing 1760 lines will
take the engine > 1,500,000.
This means that a small change in how long it takes to search for a line
delimiter (which is the fundamental operation here) makes a big
difference to how long doing 1000's of them will take - and searching a
string containing unicode is a fair bit slower than searching a string
which contains only native characters.
Fortunately there is an easy solution here - use an array so you get
random access to the lines of fff:
split fff by return
repeat with i=2 to the number of lines of indexList
put line (i+1) of indexList into which
put "(^[0-9-]*\t" & which & ")" into regX
put false into found
repeat with k=lastFound+1 to the number of elements of fff
put matchChunk(fff[k],regX,pos1,pos2) into found
if found then exit repeat
end repeat
if found then
put k into lastFound
end if
put lastFound into item i of mylineNumbers
end repeat
This should do exactly the same thing but SUBSTANTIALLY faster whether
your fff variable contains native or unicode characters.
Warmest Regards,
Mark.
--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Build Amazing Things
_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode