I was wondering if anyone has done this in Python. I wrote two functions that do it (I think... see below), but I do not understand how to interpret the results. I'm doing an experiment to implement ent in Python. ent tests the randomness of files and chi squared is probably the best test for this purposes when compared to other tests. Many of the statistical tests are easy (like Arithmetic Mean, etc) and I have no problems interpreting the results from those, but chi squared has stumped me. Here are my two simple functions, run them if you like to better understand the output:
import os import os.path def observed(f): # argument f is a filepath/filename # # Return a list of observed characters in decimal ord(char). # Decimal value of characters may be 0 through 255. # [43, 54, 0, 255, 4, etc.] chars = [] #print f fd = open(f, 'rb') bytes = fd.read(13312) fd.close() for byte in bytes: chars.append(ord(byte)) #print chars if len(chars) != 13312: print "Wait... chars does not equal 13312 in observed!!!" return None else: return chars def chi(char_list): # Expected frequency of characters. I arrived at this like so: # expected = number of observations/number of possibilities # 52 = 13312/256 expected = 52.0 print "observed\texpected\tx2" # 0 - 255 for x in range(0,256): observed = 0 for char in char_list: if x == char: observed +=1 # The three chi squared calculations # one = observed - expected # two = one squared # x2 = two/expected # x2 = (observed - expected) squared # ---------------------------- # expected one = observed - expected two = one * one x2 = two/expected print observed, "\t", expected, "\t", x2 chi(observed("filepath")) The output looks similar to this: observed expected x2 62 52.0 1.92307692308 46 52.0 0.692307692308 60 52.0 1.23076923077 68 52.0 4.92307692308 I know this is a bit off-topic here, just hoping someone could help me interpret the x2 variable. After that, I'll be OK. I need to sum up things to get an overall x2 for the bytes I've read, but before doing that, I wanted to post this note. Please feel free to comment on any aspect of this. If I've got something entirely wrong, let me know. BTW, I selected 13KB (13,312) as it seems to be efficient and a decent size to test, the data could be any amount (up to and including the whole file) above this. Thanks, Tiff -- http://mail.python.org/mailman/listinfo/python-list