Re: Filtering unicode text

David V Glasgow via use-livecode Mon, 29 Jul 2024 00:36:35 -0700

Thanks.  As ever, interesting and helpful.

Cheers


David G

> On 28 Jul 2024, at 1:31 am, Neville Smythe via use-livecode 
> <use-livecode@lists.runrev.com> wrote:
> 
> David Glasgow wrote
> 
>> I have an app I haven?t touched for a while that makes heavy use of filter 
>> of string variables up to 1,000,000 lines (but often only hundreds to tens 
>> of thousands of lines).  In my case finding all lines containing the to be 
>> found string  is a benefit
>> 
>> I have long intended to see if I can speed things up a bit.  Should I go 
>> back and look at converting string lists to arrays, then using filter, and 
>> finally converting back to a variable?  I suppose I could do this contingent 
>> upon number of lines just in case time penalties and benefits are not 
>> linear? 
> 
> I can’t say how Mark's technique would scale up to millions or hundreds of 
> thousands of lines, but certainly in my case of around 1500 lines I got a 10x 
> speedup, from an unacceptable 20 minutes to 2 minutes, processing a text file 
> which had suddenly acquired a singe character requiring unicode.
> 
> There is a caveat… I said line 1 of the keys is the first found line. That is 
> not correct. Since  arrays are stored in an internally determined way, the 
> lines will one reported in an unpredictable order. So you may need to add a 
> Sort overhead. Sort is still fast even for Unicode text, though scaling to 
> millions of lines…I don’t know; hopefully the number of found lines would be 
> small, so that wouldn’t be a problem. Just don’t search for “the”.
> 
> The take-away lesson is to avoid anything which involves recursively finding 
> line-endings in Unicode text, even if implicitly [A note to Mark W.: I still 
> think the algorithm for “line k of tText” would be worth  making more 
> efficient - as I read your comments it uses a general case search processor 
> for Unicode which has to take account of a large number of possible variants 
> of representations of characters.]
> 
> If your text is plain ascii or native (or if you could process an ascii 
> version of the text for finding strings) the benefits of converting to an 
> array may be less striking. But since the speed-up comes from the random 
> access to the found lines I wouldn’t be surprised to find an advantage even 
> there. The implementation of arrays seems to be extraordinarily efficient. 
> [An OT thought just struck me - is that why NoSQL databases as used in AWC 
> work? I know nothing about NoSQL or AWC but if I go on board with Create I 
> may have to learn.]
> 
> 
> Neville Smythe
> 
> 
> 
> 
> 
> _______________________________________________
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode


_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Filtering unicode text

Reply via email to