John Machin wrote:
> The approach that I've adopted is to test the values in a column for all 
> types, and choose the non-text type that has the highest success rate 
> (provided the rate is greater than some threshold e.g. 90%, otherwise 
> it's text).
> 
> For large files, taking a 1/N sample can save a lot of time with little 
> chance of misdiagnosis.


Why stop there? You could lower the minimum 1/N by straightforward 
application of Bayesian statistics, using results from previous tables 
as priors.


James
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to