On Fri, 18 Jan 2008 09:15:58 -0800 (PST), David Sanders <[EMAIL PROTECTED]>
wrote:
> Hi,
>
> I am processing large files of numerical data. Each line is either a
> single (positive) integer, or a pair of positive integers, where the
> second represents the number of times that the first number is
On Jan 18, 11:15 am, David Sanders <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I am processing large files of numerical data. Each line is either a
> single (positive) integer, or a pair of positive integers, where the
> second represents the number of times that the first number is
> repeated in the dat
...and just for fun this D code is about 3.2 times faster than the
Psyco version for the same dataset (30% lines with a space):
import std.stdio, std.conv, std.string, std.stream;
int[int] get_hist(string file_name) {
int[int] hist;
foreach(string line; new BufferedFile(file_name)) {
Matt:
> from collections import defaultdict
>
> def get_hist(file_name):
> hist = defaultdict(int)
> f = open(filename,"r")
> for line in f:
> vals = line.split()
> val = int(vals[0])
> try: # don't look to see if you will cause an error,
> # just c
On Fri, 18 Jan 2008 09:58:57 -0800, Paul Rubin wrote:
> David Sanders <[EMAIL PROTECTED]> writes:
>> The data files are large (~100 million lines), and this code takes a
>> long time to run (compared to just doing wc -l, for example).
>
> wc is written in carefully optimized C and will almost cer
On Fri, 18 Jan 2008 12:06:56 -0600, Tim Chase wrote:
> I don't know how efficient len() is (if it's internally linearly
> counting the items in data, or if it's caching the length as data is
> created/assigned/modifed)
It depends on what argument you pass to len().
Lists, tuples and dicts (and m
Tim Chase <[EMAIL PROTECTED]> writes:
> first = int(data[0])
> try:
> count = int(data[1])
> except:
> count = 0
By the time you're down to this kind of thing making a difference,
it's probably more important to compile with pyrex or psyco.
--
http://mail.python.org/mailman/listinfo
> for line in file:
The first thing I would try is just doing a
for line in file:
pass
to see how much time is consumed merely by iterating over the
file. This should give you a baseline from which you can base
your timings
> data = line.split()
> first = int(data[0])
>
>
David Sanders <[EMAIL PROTECTED]> writes:
> The data files are large (~100 million lines), and this code takes a
> long time to run (compared to just doing wc -l, for example).
wc is written in carefully optimized C and will almost certainly
run faster than any python program.
> Am I doing someth
On Jan 18, 9:15 am, David Sanders <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I am processing large files of numerical data. Each line is either a
> single (positive) integer, or a pair of positive integers, where the
> second represents the number of times that the first number is
> repeated in the data
On Jan 18, 12:15 pm, David Sanders <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I am processing large files of numerical data. Each line is either a
> single (positive) integer, or a pair of positive integers, where the
> second represents the number of times that the first number is
> repeated in the da
11 matches
Mail list logo