I just can't resist a good benchmark / coding challenge.

Summary - for small data (10s of thousand of lines), they're all fairly quick; for large data sets, there are big performance differences and how far through the data set the "start-with-numeric" lines begin.

I used 10 million lines of data, with 'numeric' staring at either line 10,000 or 9,999,000

simple repeat loop - clear, easy to understand, relatively slow but times vary widely filter without ... - fairly clear, generally a little bit quicker, consistent times binary search using character offset - hard to understand, MUCH faster, consistent times

specifically

repeat       between       4 and  3809 ms
filter          between 2047 and 3602
binary        between      0  and   230

For the binary search, the coding is slightly tricky. You can avoid some of the edge cases by making an assumption that there is a max length for any line (I assumed 5,000 characters, see kBrute) - but if you can't make any such assumption, then you can handle it - just need to handle a couple of special cases. I've left that as an exercise :-)

Here's the code I used   (the data is in global gData)

global gData
on mouseup
   local t1, t2, K

   put the millisecs into t1
   put f1(gData) into t2
   put t2 &&  the millisecs - t1 & CR  after msg

   put the millisecs into t1
   put f2(gData) into t2
   put t2 &&  the millisecs - t1 & CR  after msg

   put the millisecs into t1
   put f3(gData) into t2
   put t2 &&  the millisecs - t1 & CR  after msg

end mouseup

function f1 @pD
   local r
   repeat for each line L in pD
      if char 1 of L is a number then exit repeat
      add 1 to r
   end repeat
   return r + 1
end f1

function f2 pD
   filter pD without "[0-9]*"
   return the number of lines in pD + 1
end f2

function f3 @pD
   local r, tLower, tUpper, tMid, c, t, temp
   constant kBrute = 10000

   put 1 into tLower
   put (the number of chars in pD) + 1 into tUpper

   repeat 10000 times -- more than enough :-)
      if tUpper - tLower < kBrute then exit repeat
      put (tUpper + tlower ) / 2 into tMid
      put char 1 of line 2 of (char tMid to tUpper of pD) into c
      if c is a number then
         put tMid into tUpper
      else
         put tMid into tLower
      end if
   end repeat

   put char tLower to tUpper of pD into temp
   filter temp without "[0-9]*"
   if temp is empty then
      -- no lines start with numeric
      return "No lines"
   else
      put the number of lines in (char  1 to tLower of pD) into t
      add the number of lines in temp to t
      return t
   end if
end f3

-- Alex.




_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to