On Apr 29, 5:21 pm, Laszlo Nagy <gand...@shopzeus.com> wrote: > I have some ten thousand rows in a table. E.g. > > columns = ["color","size","weight","value"] > rows = [ > [ "Yellow", "Big", 2, 4 ], > [ "Blue", "Big", 3, -4 ], > [ "Blue", "Small", 10, 55 ], > ... > ] > > Some columns are dimensions, others are measures. I want to convert this > to an indexed version, that looks like this: > > dimension_names = ["color","size"] # List of dimension names > dimension_cols = [0,1] # Column indexes for dimension values > dimension_values = { # Dimension value occurences for each dimension > 0: ["Yellow","Blue",....], > 1: ["Big","Small",...],} > > measure_names = ["weight","value"] # List of measure names > measure_cols = [2,3] # List of measure columns > facts = [ # Facts, containing tuples of (dimension_value_incides, > measure_values ) > ( (0,0) , (2,4) ), > ( (1,0) , (3,-4) ), > ( (1,1) , (10,55) ), > ... > ] > > This is how I try to convert to the indexed version: > > #1. Iterate over all rows, and collect possible dimension values. > > cnt = 0 > for row in iterator_factory(): > cnt += 1 > for dimension_idx,col_idx in enumerate(dimension_cols): > dimension_values[colidx].append(row[cold_idx]) > if cnt%10000: > dimension_values[colidx] = list(set(dimension_values[colidx])) > > #2. Index facts by dimension values > > facts = [] > for row in iterator_factory(): > dv = [] > for dimension_idx,col_idx in enumerate(dimension_cols): > dv.append( dimension_values[col_idx].index(row[col_idx]) ) # > THIS IS THE PROBLEMATIC LINE! > mv = [ row[col_idx] for col_idx in measure_cols ] > facts.append( dv,mv ) > > (In reality, rows and facts are not stored in memory, because there can > be 100 000+ facts. I did not want to bore you with the full source code.) > > And finally, here is my problem. If a dimension has many possible > values, then the list.index() code above slows down eveything. In my > test case, the problematic dimension had about 36 000 different values, > and the number of rows was about 60 000. Calling index() on a list of 36 > 000 values, times 60 000, is too slow. > > Test performance was 262 rows/sec. If I use dv.append(0) instead of " > dv.append( dimension_values[col_idx].index(row[col_idx]) ) " then it > goes up to 11520 rows/sec. If I exclude the problematic dimension, and > only leave the others (with 1000 or less values) then goes up to 3000 > rows/sec. > > Maybe this should be implemented in C. But I believe that the algorithm > itself must be wrong (regardless of the language). I really think that > I'm doing something wrong. Looks like my algorithm's processing time is > not linear to the number of rows. Not even log(n)*n. There should be a > more effective way to do this. But how? > > I had the idea of sorting the rows by a given dimension. Then it would > be obvious to speed up the indexing part - for that dimension. PROBABLY > sorting all rows would be faster than calling list.index() for each row. > But... it does not seem very efficient either. What if I have 1 million > rows and 10 dimensions? Do I sort 1 million rows on the disk 10 times? > Some of you might have ran into the same problem before, and can tell me > which is the most efficient way to do this. > > Thanks,
Have you considered using a SQL database? Personally, I would use MS-Access and link it to Python via ODBC. That way, I could use the Access drag-and-drop design tools and either - copy the SQL code of working query designs to Python or - use the ODBC to link to said queries rather than directly to the raw tables Of course, Python has SQLlight now, I just don't know SQL that well. > > Laszlo -- http://mail.python.org/mailman/listinfo/python-list