On 2024-06-28 11:15, Neville Smythe via use-livecode wrote:
In my last epistle I mentioned the repeat loop had only 32 iterations. Much more relevant of course is the inner loop on the number of lines of the data variable fff. In this case fff had 1760 lines. So the total possible number of iterations was around 30000 to 50000, getting up there but still well within LC capability.

I tried operating the loop on just the first 100 lines of fff. The repeats took 0.006 seconds (impressive).

Then just the first 300 lines. Timing was 0.016 seconds, approximately linear increase as expected

Are you sure a linear increase is expected?

Then the first 500 lines. Timing is now 6.43 seconds. Something very odd there, that’s beyond exponential increase.

And on the last 500 lines, timing was 0.135 seconds (Aha !!!)

I take it you tested this by doing the loop where fff was just line 1 to 300, then just line 1 to 500 and then just -500 to -1?

My guess here is that the first 300 lines do not have a unicode character, there is somewhere in the next 200, and there are none in the last 500.

This would seem to point to matchChunk having indigestion over something in the middle of the text data. The data is 98% plain ascii, but quite likely it has a few Unicode characters: I have taken a whole lot of time to convert my legacy app to accept Asian and European Unicode personal names.

All regex matches in LC are Unicode (under the hood) - so thinking this is regex related is a red-herring. Running a regex on Unicode text, is no different from on native text - its just its dealing with 16-bit units rather than 8-bit (regex is generally a linear operation - it takes time proportional, roughly to length of pattern * length of string - well, as long as you don't use any backtracking type features).

The issue here is the assumption that your code is doing something linear...

It *looks* linear because your code is only doing two nested repeat loops - so from the point of view of lines of script its iterating the central loop roughly 'the number of lines of indexList * the number of lines of fff' - however the engine is doing a lot more work.

The central loop is doing (paraphrased);

   repeat with x = 1 to N
     get line x of jjj
   end repeat

This means the engine is searching for a line delimiter not N times, but sum(1, ..., N) times (which is N*(N-1)/2 - i.e. N^2 roughly).

Processing 300 lines will take the engine about 45000 steps, processing 500 lines will take the engine 125000 steps, processing 1760 lines will take the engine > 1,500,000.

This means that a small change in how long it takes to search for a line delimiter (which is the fundamental operation here) makes a big difference to how long doing 1000's of them will take - and searching a string containing unicode is a fair bit slower than searching a string which contains only native characters.

Fortunately there is an easy solution here - use an array so you get random access to the lines of fff:

   split fff by return
   repeat with i=2 to the number of lines of indexList
      put line (i+1) of indexList into which

      put "(^[0-9-]*\t" & which & ")" into regX

      put false into found
      repeat with k=lastFound+1 to the number of elements of fff
         put matchChunk(fff[k],regX,pos1,pos2) into found

         if found then exit repeat
      end repeat

      if found then
         put k into lastFound
      end if

      put lastFound into item i of mylineNumbers
   end repeat

This should do exactly the same thing but SUBSTANTIALLY faster whether your fff variable contains native or unicode characters.

Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Build Amazing Things

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to