On May 18, 7:07 pm, py_genetic <[EMAIL PROTECTED]> wrote: > Hello, > > I'm importing large text files of data using csv. I would like to add > some more auto sensing abilities. I'm considing sampling the data > file and doing some fuzzy logic scoring on the attributes (colls in a > data base/ csv file, eg. height weight income etc.) to determine the > most efficient 'type' to convert the attribute coll into for further > processing and efficient storage... > > Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello > there' '100,000,000,000'], [next row...] ....] > > Aside from a missing attribute designator, we can assume that the same > type of data continues through a coll. For example, a string, int8, > int16, float etc. > > 1. What is the most efficient way in python to test weather a string > can be converted into a given numeric type, or left alone if its > really a string like 'A' or 'hello'? Speed is key? Any thoughts? > > 2. Is there anything out there already which deals with this issue?
There are several replies to your immediate column type-guessing problem, so I'm not going to address that. Once you decide the converters for each column, you have to pass the dataset through them (and optionally rearrange or omit some of them). That's easy to hardcode for a few datasets with the same or similar structure but it soon gets tiring. I had a similar task recently so I wrote a general and efficient (at least as far as pure python goes) row transformer that does the repetitive work. Below are some examples from an Ipython session; let me know if this might be useful and I'll post it here or at the Cookbook. George #======= RowTransformer examples ============================ In [1]: from transrow import RowTransformer In [2]: rows = [row.split(',') for row in "1,3.34,4-3.2j,John", "4,4,4,4", "0,-1.1,3.4,None"] In [3]: rows Out[3]: [['1', '3.34', '4-3.2j', 'John'], ['4', '4', '4', '4'], ['0', '-1.1', '3.4', 'None']] # adapt the first three columns; the rest are omitted In [4]: for row in RowTransformer([int,float,complex])(rows): ...: print row ...: [1, 3.3399999999999999, (4-3.2000000000000002j)] [4, 4.0, (4+0j)] [0, -1.1000000000000001, (3.3999999999999999+0j)] # return the 2nd column as float, followed by the 4th column as is In [5]: for row in RowTransformer({1:float, 3:None})(rows): ....: print row ....: [3.3399999999999999, 'John'] [4.0, '4'] [-1.1000000000000001, 'None'] # return the 3rd column as complex, followed by the 1st column as int In [6]: for row in RowTransformer([(2,complex),(0,int)])(rows): ....: print row ....: [(4-3.2000000000000002j), 1] [(4+0j), 4] [(3.3999999999999999+0j), 0] # return the first three columns, adapted by eval() # XXX: use eval() only for trusted data In [7]: for row in RowTransformer(include=range(3), default_adaptor=eval)(rows): ....: print row ....: [1, 3.3399999999999999, (4-3.2000000000000002j)] [4, 4, 4] [0, -1.1000000000000001, 3.3999999999999999] # equivalent to the previous In [8]: for row in RowTransformer(default_adaptor=eval, exclude=[3]) (rows): ....: print row ....: [1, 3.3399999999999999, (4-3.2000000000000002j)] [4, 4, 4] [0, -1.1000000000000001, 3.3999999999999999] -- http://mail.python.org/mailman/listinfo/python-list