On Aug 7, 2:27 pm, "M.-A. Lemburg" <[EMAIL PROTECTED]> wrote: > On 2008-08-07 20:41, Laszlo Nagy wrote: > > > > > > > Hi, > > > I'm working on a pivot table. I would like to write it in Python. I > > know, I should be doing that in C, but I would like to create a cross > > platform version which can deal with smaller databases (not more than a > > million facts). > > > The data is first imported from a csv file: the user selects which > > columns contain dimension and measure data (and which columns to > > ignore). In the next step I would like to build up a database that is > > efficient enough to be used for making pivot tables. Here is my idea for > > the database: > > > Original CSV file with column header and values: > > > "Color","Year","Make","Price","VMax" > > Yellow,2000,Ferrari,100000,254 > > Blue,2003,Volvo,50000,210 > > > Using the GUI, it is converted to this: > > > dimensions = [ > > { 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green', > > 'Yellow' ], }, > > { 'name':'Year', colindex:1, 'values':[ > > 1995,1999,2000,2001,2002,2003,2007 ], }, > > { 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford', > > 'Lamborgini' ], }, > > ] > > measures = [ > > { 'name', 'Price', 'colindex':3 }, > > { 'name', 'Vmax', 'colindex':4 }, > > ] > > facts = [ > > ( (3,2,0),(100000.0,254.0) ), # ( dimension_value_indexes, > > measure_values ) > > ( (1,5,1),(50000.0,210.0) ), > > .... # Some million rows or less > > ] > > > The core of the idea is that, when using a relatively small number of > > possible values for each dimension, the facts table becomes > > significantly smaller and easier to process. (Processing the facts would > > be: iterate over facts, filter out some of them, create statistical > > values of the measures, grouped by dimensions.) > > > The facts table cannot be kept in memory because it is too big. I need > > to store it on disk, be able to read incrementally, and make statistics. > > In most cases, the "statistic" will be simple sum of the measures, and > > counting the number of facts affected. To be effective, reading the > > facts from disk should not involve complex conversions. For this reason, > > storing in CSV or XML or any textual format would be bad. I'm thinking > > about a binary format, but how can I interface that with Python? > > > I already looked at: > > > - xdrlib, which throws me DeprecationWarning when I store some integers > > - struct which uses format string for each read operation, I'm concerned > > about its speed > > > What else can I use? > > >>> import marshal > >>> marshal.dump(1, open('test.db', 'wb')) > >>> marshal.load(open('test.db', 'rb')) > 1 > > It also very fast at dumping/loading lists, tuples, dictionaries, > floats, etc.
Depending on how hard-core you want to be, store the int, float, string, and long C structures directly to disk, at a given offset. Either use fixed-length strings, or implement (or find) a memory manager. Anyone have a good alloc-realloc-free library, C or Python? -- http://mail.python.org/mailman/listinfo/python-list