Re: Efficient processing of large nuumeric data file

2008-01-20 Thread Jorgen Grahn
On Fri, 18 Jan 2008 09:15:58 -0800 (PST), David Sanders <[EMAIL PROTECTED]> wrote: > Hi, > > I am processing large files of numerical data. Each line is either a > single (positive) integer, or a pair of positive integers, where the > second represents the number of times that the first number is

Re: Efficient processing of large nuumeric data file

2008-01-19 Thread David Sanders
On Jan 18, 11:15 am, David Sanders <[EMAIL PROTECTED]> wrote: > Hi, > > I am processing large files of numerical data. Each line is either a > single (positive) integer, or a pair of positive integers, where the > second represents the number of times that the first number is > repeated in the dat

Re: Efficient processing of large nuumeric data file

2008-01-18 Thread bearophileHUGS
...and just for fun this D code is about 3.2 times faster than the Psyco version for the same dataset (30% lines with a space): import std.stdio, std.conv, std.string, std.stream; int[int] get_hist(string file_name) { int[int] hist; foreach(string line; new BufferedFile(file_name)) {

Re: Efficient processing of large nuumeric data file

2008-01-18 Thread bearophileHUGS
Matt: > from collections import defaultdict > > def get_hist(file_name): > hist = defaultdict(int) > f = open(filename,"r") > for line in f: > vals = line.split() > val = int(vals[0]) > try: # don't look to see if you will cause an error, > # just c

Re: Efficient processing of large nuumeric data file

2008-01-18 Thread Steven D'Aprano
On Fri, 18 Jan 2008 09:58:57 -0800, Paul Rubin wrote: > David Sanders <[EMAIL PROTECTED]> writes: >> The data files are large (~100 million lines), and this code takes a >> long time to run (compared to just doing wc -l, for example). > > wc is written in carefully optimized C and will almost cer

Re: Efficient processing of large nuumeric data file

2008-01-18 Thread Steven D'Aprano
On Fri, 18 Jan 2008 12:06:56 -0600, Tim Chase wrote: > I don't know how efficient len() is (if it's internally linearly > counting the items in data, or if it's caching the length as data is > created/assigned/modifed) It depends on what argument you pass to len(). Lists, tuples and dicts (and m

Re: Efficient processing of large nuumeric data file

2008-01-18 Thread Paul Rubin
Tim Chase <[EMAIL PROTECTED]> writes: > first = int(data[0]) > try: > count = int(data[1]) > except: > count = 0 By the time you're down to this kind of thing making a difference, it's probably more important to compile with pyrex or psyco. -- http://mail.python.org/mailman/listinfo

Re: Efficient processing of large nuumeric data file

2008-01-18 Thread Tim Chase
> for line in file: The first thing I would try is just doing a for line in file: pass to see how much time is consumed merely by iterating over the file. This should give you a baseline from which you can base your timings > data = line.split() > first = int(data[0]) > >

Re: Efficient processing of large nuumeric data file

2008-01-18 Thread Paul Rubin
David Sanders <[EMAIL PROTECTED]> writes: > The data files are large (~100 million lines), and this code takes a > long time to run (compared to just doing wc -l, for example). wc is written in carefully optimized C and will almost certainly run faster than any python program. > Am I doing someth

Re: Efficient processing of large nuumeric data file

2008-01-18 Thread Matimus
On Jan 18, 9:15 am, David Sanders <[EMAIL PROTECTED]> wrote: > Hi, > > I am processing large files of numerical data. Each line is either a > single (positive) integer, or a pair of positive integers, where the > second represents the number of times that the first number is > repeated in the data

Re: Efficient processing of large nuumeric data file

2008-01-18 Thread George Sakkis
On Jan 18, 12:15 pm, David Sanders <[EMAIL PROTECTED]> wrote: > Hi, > > I am processing large files of numerical data. Each line is either a > single (positive) integer, or a pair of positive integers, where the > second represents the number of times that the first number is > repeated in the da