Thanks. As ever, interesting and helpful. Cheers
David G > On 28 Jul 2024, at 1:31 am, Neville Smythe via use-livecode > <use-livecode@lists.runrev.com> wrote: > > David Glasgow wrote > >> I have an app I haven?t touched for a while that makes heavy use of filter >> of string variables up to 1,000,000 lines (but often only hundreds to tens >> of thousands of lines). In my case finding all lines containing the to be >> found string is a benefit >> >> I have long intended to see if I can speed things up a bit. Should I go >> back and look at converting string lists to arrays, then using filter, and >> finally converting back to a variable? I suppose I could do this contingent >> upon number of lines just in case time penalties and benefits are not >> linear? > > I can’t say how Mark's technique would scale up to millions or hundreds of > thousands of lines, but certainly in my case of around 1500 lines I got a 10x > speedup, from an unacceptable 20 minutes to 2 minutes, processing a text file > which had suddenly acquired a singe character requiring unicode. > > There is a caveat… I said line 1 of the keys is the first found line. That is > not correct. Since arrays are stored in an internally determined way, the > lines will one reported in an unpredictable order. So you may need to add a > Sort overhead. Sort is still fast even for Unicode text, though scaling to > millions of lines…I don’t know; hopefully the number of found lines would be > small, so that wouldn’t be a problem. Just don’t search for “the”. > > The take-away lesson is to avoid anything which involves recursively finding > line-endings in Unicode text, even if implicitly [A note to Mark W.: I still > think the algorithm for “line k of tText” would be worth making more > efficient - as I read your comments it uses a general case search processor > for Unicode which has to take account of a large number of possible variants > of representations of characters.] > > If your text is plain ascii or native (or if you could process an ascii > version of the text for finding strings) the benefits of converting to an > array may be less striking. But since the speed-up comes from the random > access to the found lines I wouldn’t be surprised to find an advantage even > there. The implementation of arrays seems to be extraordinarily efficient. > [An OT thought just struck me - is that why NoSQL databases as used in AWC > work? I know nothing about NoSQL or AWC but if I go on board with Create I > may have to learn.] > > > Neville Smythe > > > > > > _______________________________________________ > use-livecode mailing list > use-livecode@lists.runrev.com > Please visit this url to subscribe, unsubscribe and manage your subscription > preferences: > http://lists.runrev.com/mailman/listinfo/use-livecode _______________________________________________ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode