On Aug 7, 1:41 pm, Laszlo Nagy <[EMAIL PROTECTED]> wrote: > Hi, > > I'm working on a pivot table. I would like to write it in Python. I > know, I should be doing that in C, but I would like to create a cross > platform version which can deal with smaller databases (not more than a > million facts). > > The data is first imported from a csv file: the user selects which > columns contain dimension and measure data (and which columns to > ignore). In the next step I would like to build up a database that is > efficient enough to be used for making pivot tables. Here is my idea for > the database: > > Original CSV file with column header and values: > > "Color","Year","Make","Price","VMax" > Yellow,2000,Ferrari,100000,254 > Blue,2003,Volvo,50000,210 > > Using the GUI, it is converted to this: > > dimensions = [ > { 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green', > 'Yellow' ], }, > { 'name':'Year', colindex:1, 'values':[ > 1995,1999,2000,2001,2002,2003,2007 ], }, > { 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford', > 'Lamborgini' ], }, > ] > measures = [ > { 'name', 'Price', 'colindex':3 }, > { 'name', 'Vmax', 'colindex':4 }, > ] > facts = [ > ( (3,2,0),(100000.0,254.0) ), # ( dimension_value_indexes, > measure_values ) > ( (1,5,1),(50000.0,210.0) ), > .... # Some million rows or less > ] > > The core of the idea is that, when using a relatively small number of > possible values for each dimension, the facts table becomes > significantly smaller and easier to process. (Processing the facts would > be: iterate over facts, filter out some of them, create statistical > values of the measures, grouped by dimensions.) > > The facts table cannot be kept in memory because it is too big. I need > to store it on disk, be able to read incrementally, and make statistics. > In most cases, the "statistic" will be simple sum of the measures, and > counting the number of facts affected. To be effective, reading the > facts from disk should not involve complex conversions. For this reason, > storing in CSV or XML or any textual format would be bad. I'm thinking > about a binary format, but how can I interface that with Python? > > I already looked at: > > - xdrlib, which throws me DeprecationWarning when I store some integers > - struct which uses format string for each read operation, I'm concerned > about its speed > > What else can I use? > > Thanks, > > Laszlo
Take a look at the mmap module. You get direct memory access, backed by the file system. struct + mmap, if you keep your strings small? -- http://mail.python.org/mailman/listinfo/python-list