I'm using Python to parse metrics out of logfiles. The logfiles are fairly large (multiple GBs), so I'm keen to do this in a reasonably performant way.
The metrics are being sent to a InfluxDB database - so it's better if I can batch multiple metrics into a batch ,rather than sending them individually. Currently, I'm using the grouper() recipe from the itertools documentation to process multiples lines in "chunks" - I then send the collected points to the database: def grouper(iterable, n, fillvalue=None): "Collect data into fixed-length chunks or blocks" # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx args = [iter(iterable)] * n return zip_longest(fillvalue=fillvalue, *args) with open(args.input_file, 'r') as f: line_counter = 0 for chunk in grouper(f, args.batch_size): json_points = [] for line in chunk: line_counter +=1 # Do some processing json_points.append(some_metrics) if json_points: write_points(logger, client, json_points, line_counter) However, not every line will produce metrics - so I'm batching on the number of input lines, rather than on the items I send to the database. My question is, would it make sense to simply have a json_points list that accumulated metrics, check the size each iteration and then send them off when it reaches a certain size. Eg.: BATCH_SIZE = 1000 with open(args.input_file, 'r') as f: json_points = [] for line_number, line in enumerate(f): # Do some processing json_points.append(some_metrics) if len(json_points) >= BATCH_SIZE: write_points(logger, client, json_points, line_counter) json_points = [] Also, I originally used grouper because I thought it better to process lines in batches, rather than individually. However, is there actually any throughput advantage from doing it this way in Python? Or is there a better way of getting better throughput? We can assume for now that the CPU load of the processing is fairly light (mainly string splitting, and date parsing). -- https://mail.python.org/mailman/listinfo/python-list