Filtering unicode text

Neville Smythe via use-livecode Sat, 27 Jul 2024 17:33:43 -0700

David Glasgow wrote

> I have an app I haven?t touched for a while that makes heavy use of filter of 
> string variables up to 1,000,000 lines (but often only hundreds to tens of 
> thousands of lines).  In my case finding all lines containing the to be found 
> string  is a benefit
> 
> I have long intended to see if I can speed things up a bit.  Should I go back 
> and look at converting string lists to arrays, then using filter, and finally 
> converting back to a variable?  I suppose I could do this contingent upon 
> number of lines just in case time penalties and benefits are not linear?


I can’t say how Mark's technique would scale up to millions or hundreds of 
thousands of lines, but certainly in my case of around 1500 lines I got a 10x 
speedup, from an unacceptable 20 minutes to 2 minutes, processing a text file 
which had suddenly acquired a singe character requiring unicode.

There is a caveat… I said line 1 of the keys is the first found line. That is 
not correct. Since  arrays are stored in an internally determined way, the 
lines will one reported in an unpredictable order. So you may need to add a 
Sort overhead. Sort is still fast even for Unicode text, though scaling to 
millions of lines…I don’t know; hopefully the number of found lines would be 
small, so that wouldn’t be a problem. Just don’t search for “the”.

The take-away lesson is to avoid anything which involves recursively finding 
line-endings in Unicode text, even if implicitly [A note to Mark W.: I still 
think the algorithm for “line k of tText” would be worth  making more efficient 
- as I read your comments it uses a general case search processor for Unicode 
which has to take account of a large number of possible variants of 
representations of characters.]

If your text is plain ascii or native (or if you could process an ascii version 
of the text for finding strings) the benefits of converting to an array may be 
less striking. But since the speed-up comes from the random access to the found 
lines I wouldn’t be surprised to find an advantage even there. The 
implementation of arrays seems to be extraordinarily efficient. [An OT thought 
just struck me - is that why NoSQL databases as used in AWC work? I know 
nothing about NoSQL or AWC but if I go on board with Create I may have to 
learn.]

 
Neville Smythe





_______________________________________________
use-livecode mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Filtering unicode text

Reply via email to