Elaina Ann Hyde wrote: > So, Python question of the day: I have 2 files that I could normally just > read in with asciitable, The first file is a 12 column 8000 row table that > I have read in via asciitable and manipulated. The second file is > enormous, has over 50,000 rows and about 20 columns. What I want to do is > find the best match for (file 1 column 1 and 2) with (file 2 column 4 and > 5), return all rows that match from the huge file, join them togeather and > save the whole mess as a file with 8000 rows (assuming the smaller table > finds one match per row) and 32=12+20 columns. So my read code so far is > as follows: > ------------------------------------------------- > import sys > import asciitable > import matplotlib > import scipy > import numpy as np > from numpy import * > import math > import pylab > import random > from pylab import * > import astropysics > import astropysics.obstools > import astropysics.coords > > x=small_file > #cannot read blank values (string!) if blank insert -999.99 > dat=asciitable.read(x,Reader=asciitable.CommentedHeader, > fill_values=['','-999.99']) > y=large_file > fopen2=open('cfile2match.list','w') > dat2=asciitable.read(y,Reader=asciitable.CommentedHeader, > fill_values=['','-999.99']) > #here are the 2 values for the small file > Radeg=dat['ra-drad']*180./math.pi > Decdeg=dat['dec-drad']*180./math.pi > > #here are the 2 values for the large file > Radeg2=dat2['ra-drad']*180./math.pi > Decdeg2=dat2['dec-drad']*180./math.pi > > for i in xrange(len(Radeg)): > for j in xrange(len(Radeg2)): > #select the value if it is very, very, very close > if i != j and Radeg[i] <= (Radeg2[j]+0.000001) and > Radeg[i] >>= (Radeg2[j]-0.000001) and Decdeg[i] <= (Decdeg2[j]+0.000001) and > Decdeg[i] >= (Decdeg2[j]-0.000001): > fopen.write( " ".join([str(k) for k in > list(dat[i])])+" "+" ".join([str(k) for k in list(dat[j])])+"\n") > ------------------------------------------- > Now this is where I had to stop, this is way, way too long and messy. I > did a similar approach with smaller (9000 lines each) files and it worked > but took awhile, the problem here is I am going to have to play with the > match range to return the best result and give only one (1!) match per row > for my smaller file, i.e. row 1 of small file must match only 1 row of > large file..... then I just need to return them both. However, it isn't > clear to me that this is the best way forward. I have been changing the > xrange to low values to play with the matching, but I would appreciate any > ideas. Thanks
If you calculate the distance instead of checking if it's under a certain threshold you are guaranteed to get (one of the) best matches. Pseudo-code: from functools import partial big_rows = read_big_file_into_memory() def distance(small_row, big_row): ... for small_row in read_small_file(): best_match = min(big_rows, key=partial(dist, small_row)) write_to_result_file(best_match) As to the actual implementation of the distance() function, I don't understand your problem description (two columns in the first, three in the second, how does that work), but generally a, c = extract_columns_from_small_row(small_row) b, d = extract_columns_from_big_row(big_row) if (a <= b + eps) and (c <= d + eps): # it's good would typically become distance(small_row, big_row): a, c = extract_columns_from_small_row(small_row) b, d = extract_columns_from_big_row(big_row) x = a-b y = c-d return math.sqrt(x*x+y*y) _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor