Geoff Canyon wrote:

> I opened my mouth at work when I shouldn't, and now I'm writing a
> function to process server log files: multi-gigabytes of data, and
> tens of million rows of data. Speed optimization will be key...

What sort of processing are you doing on those logs? What are you looking for that you can't get with Google Analytics? And how big is the resulting data, and where does it go, to a DB or another text file or piped to something else?

I had a similar task a while back, and wrote a command that reads in chunks according to a specified buffer size, parsing the buffer by a specified delimiter. It dispatches callbacks for each element so I could use it as a sort of ersatz MapReduce, keeping the element parsing separate from the processing, allowing me use it in different contexts without having to rewrite the buffering stuff each time.

While dispatch is measurably a little faster than send, it still eats more time than processing in-line. For my needs it was a reasonably efficient trade-off: in a test case where the processing callback merely obtains the second item from the element passed to it and appends it to a list, with a read buffer size of just 128k it churns through a 845 MB file containing 758,721 elements in under 10 seconds.

But on a collection as large as your files, the benefits of a generalized approach using callbacks may be lost by the time consumed by dispatch.

Interestingly, I seem to have stumbled across a bit of a counter-intuitive relationship between buffer size and performance. I had expected that using the largest-possible buffer size would always be faster since it's making fewer disk accesses. But apparently the overhead within LC to allocate blocks of memory somewhat mitigates that, with the following results:

Buffer size           Total time
-------------------   ----------
2097152 bytes  (2MB)  10.444 seconds
1048576 bytes  (1MB)  10.284 seconds
 524288 bytes (512k)  10.256 seconds
 262144 bytes (256k)   9.384 seconds
 131072 bytes (128k)   9.274 seconds
  65536 bytes  (64k)   9.312 seconds

These are inexact timings, but still the trend is interesting, if indeed repeatable (these were one-off tests, and I haven't tested this on multiple machines).

Given your background you've probably already decided to avoid using "read until cr" for parsing, since of course that requires the engine to examine every character in the stream for the delimiter.

But if you haven't yet done much benchmarking on this, reading as binary is often much faster than reading as text for similar reasons, since binary mode is a raw scrape from disk while text mode translates NULLs and line endings on the fly. In a quckie test of my file read handler, using text mode adds nearly 30% to the overall time.

A much smaller benefit is reading in chunk sizes that are a multiple of the file system's block size. On HFS+, NFS, and EXT4 the default is 4k; many DB engines use multiples of 4k for their internal blocks for this reason, aligning them with the host file I/O. While the speed difference in aligning to the file system block size from a scripting language is minimal, with a collection as large as yours it may add up.

--
 Richard Gaskin
 Fourth World
 LiveCode training and consulting: http://www.fourthworld.com
 Webzine for LiveCode developers: http://www.LiveCodeJournal.com
 Follow me on Twitter:  http://twitter.com/FourthWorldSys


_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to