On May 21, 2:04 am, Paddy <[EMAIL PROTECTED]> wrote: > On May 20, 1:12 pm, John Machin <[EMAIL PROTECTED]> wrote: > > > > > On 20/05/2007 8:52 PM, Paddy wrote: > > > > On May 20, 2:16 am, John Machin <[EMAIL PROTECTED]> wrote: > > >> On 19/05/2007 3:14 PM, Paddy wrote: > > > >>> On May 19, 12:07 am, py_genetic <[EMAIL PROTECTED]> wrote: > > >>>> Hello, > > >>>> I'm importing large text files of data using csv. I would like to add > > >>>> some more auto sensing abilities. I'm considing sampling the data > > >>>> file and doing some fuzzy logic scoring on the attributes (colls in a > > >>>> data base/ csv file, eg. height weight income etc.) to determine the > > >>>> most efficient 'type' to convert the attribute coll into for further > > >>>> processing and efficient storage... > > >>>> Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello > > >>>> there' '100,000,000,000'], [next row...] ....] > > >>>> Aside from a missing attribute designator, we can assume that the same > > >>>> type of data continues through a coll. For example, a string, int8, > > >>>> int16, float etc. > > >>>> 1. What is the most efficient way in python to test weather a string > > >>>> can be converted into a given numeric type, or left alone if its > > >>>> really a string like 'A' or 'hello'? Speed is key? Any thoughts? > > >>>> 2. Is there anything out there already which deals with this issue? > > >>>> Thanks, > > >>>> Conor > > >>> You might try investigating what can generate your data. With luck, > > >>> it could turn out that the data generator is methodical and column > > >>> data-types are consistent and easily determined by testing the > > >>> first or second row. At worst, you will get to know how much you > > >>> must check for human errors. > > >> Here you go, Paddy, the following has been generated very methodically; > > >> what data type is the first column? What is the value in the first > > >> column of the 6th row likely to be? > > > >> "$39,082.00","$123,456.78" > > >> "$39,113.00","$124,218.10" > > >> "$39,141.00","$124,973.76" > > >> "$39,172.00","$125,806.92" > > >> "$39,202.00","$126,593.21" > > > >> N.B. I've kindly given you five lines instead of one or two :-) > > > >> Cheers, > > >> John > > > > John, > > > I've had cases where some investigation of the source of the data has > > > completely removed any ambiguity. I've found that data was generated > > > from one or two sources and been able to know what every field type is > > > by just examining a field that I have determined wil tell me the > > > source program that generated the data. > > > The source program that produced my sample dataset was Microsoft Excel > > (or OOo Calc or Gnumeric); it was induced to perform a "save as CSV" > > operation. Does that help you determine the true nature of the first column? > > > > I have also found that the flow generating some data is subject to > > > hand editing so have had to both put in extra checks in my reader, and > > > on some occasions created specific editors to replace hand edits by > > > checked assisted hand edits. > > > I stand by my statement; "Know the source of your data", its less > > > likely to bite! > > > My dataset has a known source, and furthermore meets your "lucky" > > criteria (methodically generated, column type is consistent) -- I'm > > waiting to hear from you about the "easily determined" part :-) > > > Cheers, > > John > > John, > Open up your Excel spreadsheet and check what the format is for the > column. It's not a contest. If you KNOW what generated the data then > USE that knowledge. It would be counter-productive to do otherwise > surely? > > (I know, don't call you Shirley :-) >
... and I won't call you Patsy more than this once :-) Patsy, re-read. The scenario is that I don't have the Excel spreadsheet; I have a CSV file. The format is rather obviously "currency" but that is not correct. The point is that (1) it was methodically [mis-]produced by a known source [your criteria] but the correct type of column 1 can't be determined by inspection of a value or 2. Yeah, it's not a contest, but I was kinda expecting that you might have taken first differences of column 1 by now ... Cheers, John -- http://mail.python.org/mailman/listinfo/python-list