On Fri, 18 Jan 2008 09:15:58 -0800 (PST), David Sanders <[EMAIL PROTECTED]> wrote: > Hi, > > I am processing large files of numerical data. Each line is either a > single (positive) integer, or a pair of positive integers, where the > second represents the number of times that the first number is > repeated in the data -- this is to avoid generating huge raw files, > since one particular number is often repeated in the data generation > step. > > My question is how to process such files efficiently to obtain a > frequency histogram of the data (how many times each number occurs in > the data, taking into account the repetitions). My current code is as > follows:
... > The data files are large (~100 million lines), and this code takes a > long time to run (compared to just doing wc -l, for example). I don't know if you are in control of the *generation* of data, but I think it's often better and more convenient to pipe the raw data through 'gzip -c' (i.e. gzip-compress it before it hits the disk) than to figure out a smart application-specific compression scheme. Maybe if you didn't have a homegrown file format, there would have been readymade histogram utilities? Or at least a good reason to spend the time writing an optimized C version. /Jorgen -- // Jorgen Grahn <grahn@ Ph'nglui mglw'nafh Cthulhu \X/ snipabacken.se> R'lyeh wgah'nagl fhtagn! -- http://mail.python.org/mailman/listinfo/python-list