On May 20, 2:16 am, John Machin <[EMAIL PROTECTED]> wrote: > On 19/05/2007 3:14 PM, Paddy wrote: > > > > > On May 19, 12:07 am, py_genetic <[EMAIL PROTECTED]> wrote: > >> Hello, > > >> I'm importing large text files of data using csv. I would like to add > >> some more auto sensing abilities. I'm considing sampling the data > >> file and doing some fuzzy logic scoring on the attributes (colls in a > >> data base/ csv file, eg. height weight income etc.) to determine the > >> most efficient 'type' to convert the attribute coll into for further > >> processing and efficient storage... > > >> Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello > >> there' '100,000,000,000'], [next row...] ....] > > >> Aside from a missing attribute designator, we can assume that the same > >> type of data continues through a coll. For example, a string, int8, > >> int16, float etc. > > >> 1. What is the most efficient way in python to test weather a string > >> can be converted into a given numeric type, or left alone if its > >> really a string like 'A' or 'hello'? Speed is key? Any thoughts? > > >> 2. Is there anything out there already which deals with this issue? > > >> Thanks, > >> Conor > > > You might try investigating what can generate your data. With luck, > > it could turn out that the data generator is methodical and column > > data-types are consistent and easily determined by testing the > > first or second row. At worst, you will get to know how much you > > must check for human errors. > > Here you go, Paddy, the following has been generated very methodically; > what data type is the first column? What is the value in the first > column of the 6th row likely to be? > > "$39,082.00","$123,456.78" > "$39,113.00","$124,218.10" > "$39,141.00","$124,973.76" > "$39,172.00","$125,806.92" > "$39,202.00","$126,593.21" > > N.B. I've kindly given you five lines instead of one or two :-) > > Cheers, > John
John, I've had cases where some investigation of the source of the data has completely removed any ambiguity. I've found that data was generated from one or two sources and been able to know what every field type is by just examining a field that I have determined wil tell me the source program that generated the data. I have also found that the flow generating some data is subject to hand editing so have had to both put in extra checks in my reader, and on some occasions created specific editors to replace hand edits by checked assisted hand edits. I stand by my statement; "Know the source of your data", its less likely to bite! - Paddy. -- http://mail.python.org/mailman/listinfo/python-list