John Machin wrote: >Against that background, please explain to me how I can use > "results from previous tables as priors". > > Cheers, > John
It depends on how you want to model your probabilities, but, as an example, you might find the following frequencies of columns in all tables you have parsed from this organization: 35% Strings, 25% Floats, 20% Ints, 15% Date MMDDYYYY, and 5% Date YYMMDD. Let's say that you have also used prior counting statistics to find that there is a 2% error rate in the columns (2% of the values of a typical Float column fail to cast to Float, 2% of values in Int columns fail to cast to Int, and so-on, though these need not all be equal). Lets also say that for non-Int columns, 1% of cells randomly selected cast to Int. These percentages could be converted to probabilities and these probabilities could be used as priors in Bayesian scheme to determine a column type. Lets say you take one cell randomly and it can be cast to an Int. What is the probability that the column is an Int? (See <http://tinyurl.com/2bdn38>.) P_1(H) = 0.20 --> Prior (20% prior columns are Int columns) P(D|H) = 0.98 P(D|H') = 0.01 P_1(H|D) = 0.9607843 --> Posterior & New Prior "P_2(H|D)" Now with one test positive for Int, you are getting pretty certain you have an Int column. Now we take a second cell randomly from the same column and find that it too casts to Int. P_2(H) = 0.9607843 --> Confidence its an Int column from round 1 P(D|H) = 0.98 P(D|H') = 0.02 P_2(H|D) = 0.9995836 Yikes! But I'm still not convinced its an Int because I haven't even had to wait a millisecond to get the answer. Lets burn some more clock cycles. Lets say we really have an Int column and get "lucky" with our tests (P = 0.98**4 = 92% chance) and find two more random cells successfully cast to Int: P_4(H) = 0.9999957 P(D|H) = 0.98 P(D|H') = 0.02 P(H|D) = 0.9999999 I don't know about you, but after only four positives, my calculator ran out of significant digits and so I am at least 99.99999% convinced its an Int column and I'm going to stop wasting CPU cycles and move on to test the next column. How do you know its not a float? Well, given floats with only one decimal place, you would expect only 1/10th could be cast to Int (were the tenths-decimal place to vary randomly). You could generate a similar statistical model to convince yourself with vanishing uncertainty that the column that tests positive for Int four times in a (random sample) is not actually a Float (even with only one decimal place known). James -- http://mail.python.org/mailman/listinfo/python-list