Reading in large logfiles, and processing lines in batches - maximising throughput?

Victor Hooi Wed, 16 Sep 2015 02:32:55 -0700

I'm using Python to parse metrics out of logfiles.

The logfiles are fairly large (multiple GBs), so I'm keen to do this in a 
reasonably performant way.


The metrics are being sent to a InfluxDB database - so it's better if I can 
batch multiple metrics into a batch ,rather than sending them individually.

Currently, I'm using the grouper() recipe from the itertools documentation to 
process multiples lines in "chunks" - I then send the collected points to the 
database:

    def grouper(iterable, n, fillvalue=None):
        "Collect data into fixed-length chunks or blocks"
        # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
        args = [iter(iterable)] * n
        return zip_longest(fillvalue=fillvalue, *args)

    with open(args.input_file, 'r') as f:
        line_counter = 0
        for chunk in grouper(f, args.batch_size):
            json_points = []
            for line in chunk:
                line_counter +=1
                # Do some processing
                json_points.append(some_metrics)
            if json_points:
                write_points(logger, client, json_points, line_counter)

However, not every line will produce metrics - so I'm batching on the number of 
input lines, rather than on the items I send to the database.

My question is, would it make sense to simply have a json_points list that 
accumulated metrics, check the size each iteration and then send them off when 
it reaches a certain size. Eg.:

    BATCH_SIZE = 1000

    with open(args.input_file, 'r') as f:
        json_points = []
        for line_number, line in enumerate(f):
            # Do some processing
            json_points.append(some_metrics)
            if len(json_points) >= BATCH_SIZE:
                write_points(logger, client, json_points, line_counter)
                json_points = []

Also, I originally used grouper because I thought it better to process lines in 
batches, rather than individually. However, is there actually any throughput 
advantage from doing it this way in Python? Or is there a better way of getting 
better throughput?

We can assume for now that the CPU load of the processing is fairly light 
(mainly string splitting, and date parsing).
-- 
https://mail.python.org/mailman/listinfo/python-list

Reading in large logfiles, and processing lines in batches - maximising throughput?

Reply via email to