On Fri, 24 Jan 2014 20:58:50 -0800, theguy wrote: > I know. I'm kind of ashamed of the code, but it does the job I need it > to up to a certain point
OK, well first of all take a step back and look at the problem. You have n exemplars, each from a known author. You analyse each exemplar, and determine some statistics for it. You then take your unknown sample, determine the same statistics for the unknown sample. Finally, you compare each exemplar's stats with the sample's stats to try and find a best match. So, perhaps you want a dictionary of { author: statistics }, and a function to analyse a piece of text, which might call other functions to get eg avg words / sentence, avg letters / sentence, avg word length, and the sd in each, and the short word ratio (words <= 3 chars vs words >= 4 chars) and some other statistics. Given the statistics for each exemplar, you might store these in your dictionary as a tuple. this isn't python, it's a description of an algorithm, it just looks a bit pythonic: # tuple of weightings applied to different stats stat_weightings = ( 1.0, 1.3, 0.85, ...... ) def get_some_stat( t ): # calculate some numerical statistic on a block of text # return it def analyse( f ): text = read_file( f ) return ( get_some_stat( text ), ...... ) exemplars = {} for exemplar_file in exemplar_files: exemplar_data[author] = analyse( exemplar_file ) sample_data = analyse( sample_file ) scores = {} tmp = 0 x = 0 # score for a piece of work is sum of ( diff of stat * weighting ) # for all the stats, lower score = closer match for author in keys( exemplar_data ): for i in len( exemplar_data[ author ] ): tmp = tmp + sqrt( exemplar_data[ author ][ i ] - sample_data[ i ] ) * stat_weightings( i ) scores[ author ] = tmp if tmp > x: x = tmp names = [] for author in keys( scores ): if scores[ author ] < x: x = scores[ author ] names = [ author ] elif scores[ author ] == x: names.append( [ author ] ) print "the best matching author(s) is/are: ", names Then all you have to do is find enough ways to calculate stats, and the magic coefficients to use in the stat_weightings -- Denis McMahon, denismfmcma...@gmail.com -- https://mail.python.org/mailman/listinfo/python-list