On Aug 10, 4:58 am, castironpi <[EMAIL PROTECTED]> wrote: > On Aug 7, 2:27 pm, "M.-A. Lemburg" <[EMAIL PROTECTED]> wrote: > > > > > On 2008-08-07 20:41, Laszlo Nagy wrote: > > > > Hi, > > > > I'm working on a pivot table. I would like to write it in Python. I > > > know, I should be doing that in C, but I would like to create a cross > > > platform version which can deal with smaller databases (not more than a > > > million facts). > > > > The data is first imported from a csv file: the user selects which > > > columns contain dimension and measure data (and which columns to > > > ignore). In the next step I would like to build up a database that is > > > efficient enough to be used for making pivot tables. Here is my idea for > > > the database: > > > > Original CSV file with column header and values: > > > > "Color","Year","Make","Price","VMax" > > > Yellow,2000,Ferrari,100000,254 > > > Blue,2003,Volvo,50000,210 > > > > Using the GUI, it is converted to this: > > > > dimensions = [ > > > { 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green', > > > 'Yellow' ], }, > > > { 'name':'Year', colindex:1, 'values':[ > > > 1995,1999,2000,2001,2002,2003,2007 ], }, > > > { 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford', > > > 'Lamborgini' ], }, > > > ] > > > measures = [ > > > { 'name', 'Price', 'colindex':3 }, > > > { 'name', 'Vmax', 'colindex':4 }, > > > ] > > > facts = [ > > > ( (3,2,0),(100000.0,254.0) ), # ( dimension_value_indexes, > > > measure_values ) > > > ( (1,5,1),(50000.0,210.0) ), > > > .... # Some million rows or less > > > ] > > > > The core of the idea is that, when using a relatively small number of > > > possible values for each dimension, the facts table becomes > > > significantly smaller and easier to process. (Processing the facts would > > > be: iterate over facts, filter out some of them, create statistical > > > values of the measures, grouped by dimensions.) > > > > The facts table cannot be kept in memory because it is too big. I need > > > to store it on disk, be able to read incrementally, and make statistics. > > > In most cases, the "statistic" will be simple sum of the measures, and > > > counting the number of facts affected. To be effective, reading the > > > facts from disk should not involve complex conversions. For this reason, > > > storing in CSV or XML or any textual format would be bad. I'm thinking > > > about a binary format, but how can I interface that with Python? > > > > I already looked at: > > > > - xdrlib, which throws me DeprecationWarning when I store some integers > > > - struct which uses format string for each read operation, I'm concerned > > > about its speed > > > > What else can I use? > > > >>> import marshal > > >>> marshal.dump(1, open('test.db', 'wb')) > > >>> marshal.load(open('test.db', 'rb')) > > 1 > > > It also very fast at dumping/loading lists, tuples, dictionaries, > > floats, etc. > > Depending on how hard-core you want to be, store the int, float, > string, and long C structures directly to disk, at a given offset. > Either use fixed-length strings, or implement (or find) a memory > manager. Anyone have a good alloc-realloc-free library, C or Python?
A long time ago, when I last needed to bother about such things (to override the memory allocator in the DJGPP RTL), Doug Lea's malloc did the trick. A memory allocator written in Python? That's a novel concept. -- http://mail.python.org/mailman/listinfo/python-list