On Apr 21, 4:32 pm, Jon Clements <jon...@googlemail.com> wrote: > On Apr 21, 5:40 pm, nn <prueba...@latinmail.com> wrote: > > > > > > > > > > > time head -1000000 myfile >/dev/null > > > real 0m4.57s > > user 0m3.81s > > sys 0m0.74s > > > time ./repnullsalt.py '|' myfile > > 0 1 Null columns: > > 11, 20, 21, 22, 23, 24, 25, 26, 27, 30, 31, 33, 45, 50, 68 > > > real 1m28.94s > > user 1m28.11s > > sys 0m0.72s > > > import sys > > def main(): > > with open(sys.argv[2],'rb') as inf: > > limit = sys.argv[3] if len(sys.argv)>3 else 1 > > dlm = sys.argv[1].encode('latin1') > > nulls = [x==b'' for x in next(inf)[:-1].split(dlm)] > > enum = enumerate > > split = bytes.split > > out = sys.stdout > > prn = print > > for j, r in enum(inf): > > if j%1000000==0: > > prn(j//1000000,end=' ') > > out.flush() > > if j//1000000>=limit: > > break > > for i, cur in enum(split(r[:-1],dlm)): > > nulls[i] |= cur==b'' > > print('Null columns:') > > print(', '.join(str(i+1) for i,val in enumerate(nulls) if val)) > > > if not (len(sys.argv)>2): > > sys.exit("Usage: "+sys.argv[0]+ > > " <delimiter> <filename> <limit>") > > > main() > > What's with the aliasing enumerate and print??? And on heavy disk IO I > can hardly see that name lookups are going to be any problem at all? > And why the time stats with /dev/null ??? > > I'd probably go for something like: > > import csv > > with open('somefile') as fin: > nulls = set() > for row in csv.reader(fin, delimiter='|'): > nulls.update(idx for idx,val in enumerate(row, start=1) if not > val) > print 'nulls =', sorted(nulls) > > hth > Jon
Thanks, Jon aliasing is a common method to avoid extra lookups. The time stats for head is giving the pure I/O time. So of the 88 seconds the python program takes 5 seconds are due to I/O, so there is quite a bit of overhead. I ended up with this, not super fast so I probably won't be running it against all 350 million rows of my file but faster than before: time head -1000000 myfile |./repnulls.py nulls = [11, 20, 21, 22, 23, 24, 25, 26, 27, 30, 31, 33, 45, 50, 68] real 0m49.95s user 0m53.13s sys 0m2.21s import sys def main(): fin = sys.stdin.buffer dlm = sys.argv[1].encode('latin1') if len(sys.argv)>1 else b'|' nulls = set() nulls.update(i for row in fin for i, val in enumerate(row[:-1].split(dlm), start=1) if not val) print('nulls =', sorted(nulls)) main() -- http://mail.python.org/mailman/listinfo/python-list