On Jan 18, 9:15 am, David Sanders <[EMAIL PROTECTED]> wrote: > Hi, > > I am processing large files of numerical data. Each line is either a > single (positive) integer, or a pair of positive integers, where the > second represents the number of times that the first number is > repeated in the data -- this is to avoid generating huge raw files, > since one particular number is often repeated in the data generation > step. > > My question is how to process such files efficiently to obtain a > frequency histogram of the data (how many times each number occurs in > the data, taking into account the repetitions). My current code is as > follows: > > ------------------- > #!/usr/bin/env python > # Counts the occurrences of integers in a file and makes a histogram > of them > # Allows for a second field which gives the number of counts of each > datum > > import sys > args = sys.argv > num_args = len(args) > > if num_args < 2: > print "Syntaxis: count.py archivo" > sys.exit(); > > name = args[1] > file = open(name, "r") > > hist = {} # dictionary for histogram > num = 0 > > for line in file: > data = line.split() > first = int(data[0]) > > if len(data) == 1: > count = 1 > else: > count = int(data[1]) # more than one repetition > > if first in hist: # add the information to the histogram > hist[first]+=count > else: > hist[first]=count > > num+=count > > keys = hist.keys() > keys.sort() > > print "# i fraction hist[i]" > for i in keys: > print i, float(hist[i])/num, hist[i] > --------------------- > > The data files are large (~100 million lines), and this code takes a > long time to run (compared to just doing wc -l, for example). > > Am I doing something very inefficient? (Any general comments on my > pythonic (or otherwise) style are also appreciated!) Is > "line.split()" efficient, for example? > > Is a dictionary the right way to do this? In any given file, there is > an upper bound on the data, so it seems to me that some kind of array > (numpy?) would be more efficient, but the upper bound changes in each > file.
My first suggestion is to wrap your code in a function. Functions run much faster in python than module level code, so that will give you a speed up right away. My second suggestion is to look into using defaultdict for your histogram. A dictionary is a very appropriate way to store this data. There has been some mention of a bag type, which would do exactly what you need, but unfortunately there is not a built in bag type (yet). I would write it something like this: from collections import defaultdict def get_hist(file_name): hist = defaultdict(int) f = open(filename,"r") for line in f: vals = line.split() val = int(vals[0]) try: # don't look to see if you will cause an error, # just cause it and then deal with it cnt = int(vals[1]) except IndexError: cnt = 1 hist[val] += cnt return hist HTH Matt -- http://mail.python.org/mailman/listinfo/python-list