John Machin wrote: > The model would have to be a lot more complicated than that. There is a > base number of required columns. The kind suppliers of the data randomly > add extra columns, randomly permute the order in which the columns > appear, and, for date columns
I'm going to ignore this because these things have absolutely no affect on the analysis whatsoever. Random order of columns? How could this influence any statistics, counting, Bayesian, or otherwise? randomly choose the day-month-year order, > how much punctuation to sprinkle between the digits, and whether to > append some bonus extra bytes like " 00:00:00". I absolutely do not understand how bonus bytes or any of the above would selectively adversely affect any single type of statistics--if your converter doesn't recognize it then your converter doesn't recognize it and so it will fail under every circumstance and influence any and all statistical analysis. Under such conditions, I want very robust analysis--probably more robust than simple counting statistics. And I definitely want something more efficient. > Past stats on failure to cast are no guide to the future Not true when using Bayesian statistics (and any type of inference for that matter). For example, where did you get 90% cutoff? From experience? I thought that past stats are no guide to future expectations? ... a sudden > change in the failure rate can be caused by the kind folk introducing a > new null designator i.e. outside the list ['', 'NULL', 'NA', 'N/A', > '#N/A!', 'UNK', 'UNKNOWN', 'NOT GIVEN', etc etc etc] Using the rough model and having no idea that they threw in a few weird designators so that you might suspect a 20% failure (instead of the 2% I modeled previously), the *low probabilities of false positives* (say 5% of the non-Int columns evaluate to integer--after you've eliminated dates because you remembered to test more restrictive types first) would still *drive the statistics*. Remember, the posteriors become priors after the first test. P_1(H) = 0.2 (Just a guess, it'll wash after about 3 tests.) P(D|H) = 0.8 (Are you sure they have it together enough to pay you?) P(D|H') = 0.05 (5% of the names, salaries, etc., evaluate to float?) Lets model failures since the companies you work with have bad typists. We have to reverse the probabilities for this: Pf_1(H) = 0.2 (Only if this is round 1.) Pf(D|H) = 0.2 (We *guess* a 20% chance by random any column is Int.) Pf(D|H') = 0.80 (80% of Ints fail because of carpel tunnel, ennui, etc.) You might take issue with Pf(D|H) = 0.2. I encourage you to try a range of values here to see what the posteriors look like. You'll find that this is not as important as the *low false positive rate*. For example, lets not stop until we are 99.9% sure one way or the other. With this cutoff, lets suppose this deplorable display of typing integers: pass-fail-fail-pass-pass-pass which might be expected from the above very pessimistic priors (maybe you got data from the _Apathy_Coalition_ or the _Bad_Typists_Union_ or the _Put_a_Quote_Around_Every_5th_Integer_League_): P_1(H|D) = 0.800 (pass) P_2(H|D) = 0.500 (fail) P_3(H|D) = 0.200 (fail--don't stop, not 99.9% sure) P_4(H|D) = 0.800 (pass) P_6(H|D) = 0.9846153 (pass--not there yet) P_7(H|D) = 0.9990243 (pass--got it!) Now this is with 5% all salaries, names of people, addresses, favorite colors, etc., evaluating to integers. (Pausing while I remember fondly Uncle 41572--such a nice guy...funny name, though.) > There is also the problem of first-time-participating organisations -- > in police parlance, they have no priors :-) Yes, because they teleported from Alpha Centauri where organizations are fundamentally different from here on Earth and we can not make any reasonable assumptions about them--like that they will indeed cough up money when the time comes or that they speak a dialect of an earth language or that they even generate spreadsheets for us to parse. James -- http://mail.python.org/mailman/listinfo/python-list