Re: Processing Big-ish Data

Richard Gaskin Thu, 12 Sep 2013 14:41:03 -0700

Geoff Canyon wrote:

> I opened my mouth at work when I shouldn't, and now I'm writing a
> function to process server log files: multi-gigabytes of data, and
> tens of million rows of data. Speed optimization will be key...

What sort of processing are you doing on those logs? What are youlooking for that you can't get with Google Analytics? And how big isthe resulting data, and where does it go, to a DB or another text fileor piped to something else?

I had a similar task a while back, and wrote a command that reads inchunks according to a specified buffer size, parsing the buffer by aspecified delimiter. It dispatches callbacks for each element so Icould use it as a sort of ersatz MapReduce, keeping the element parsingseparate from the processing, allowing me use it in different contextswithout having to rewrite the buffering stuff each time.

While dispatch is measurably a little faster than send, it still eatsmore time than processing in-line. For my needs it was a reasonablyefficient trade-off: in a test case where the processing callback merelyobtains the second item from the element passed to it and appends it toa list, with a read buffer size of just 128k it churns through a 845 MBfile containing 758,721 elements in under 10 seconds.

But on a collection as large as your files, the benefits of ageneralized approach using callbacks may be lost by the time consumed bydispatch.

Interestingly, I seem to have stumbled across a bit of acounter-intuitive relationship between buffer size and performance. Ihad expected that using the largest-possible buffer size would always befaster since it's making fewer disk accesses. But apparently theoverhead within LC to allocate blocks of memory somewhat mitigates that,with the following results:


Buffer size           Total time
-------------------   ----------
2097152 bytes  (2MB)  10.444 seconds
1048576 bytes  (1MB)  10.284 seconds
 524288 bytes (512k)  10.256 seconds
 262144 bytes (256k)   9.384 seconds
 131072 bytes (128k)   9.274 seconds
  65536 bytes  (64k)   9.312 seconds

These are inexact timings, but still the trend is interesting, if indeedrepeatable (these were one-off tests, and I haven't tested this onmultiple machines).

Given your background you've probably already decided to avoid using"read until cr" for parsing, since of course that requires the engine toexamine every character in the stream for the delimiter.

But if you haven't yet done much benchmarking on this, reading as binaryis often much faster than reading as text for similar reasons, sincebinary mode is a raw scrape from disk while text mode translates NULLsand line endings on the fly. In a quckie test of my file read handler,using text mode adds nearly 30% to the overall time.

A much smaller benefit is reading in chunk sizes that are a multiple ofthe file system's block size. On HFS+, NFS, and EXT4 the default is 4k;many DB engines use multiples of 4k for their internal blocks for thisreason, aligning them with the host file I/O. While the speeddifference in aligning to the file system block size from a scriptinglanguage is minimal, with a collection as large as yours it may add up.


--
 Richard Gaskin
 Fourth World
 LiveCode training and consulting: http://www.fourthworld.com
 Webzine for LiveCode developers: http://www.LiveCodeJournal.com
 Follow me on Twitter:  http://twitter.com/FourthWorldSys


_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Processing Big-ish Data

Reply via email to