Re: Fastest way to store ints and floats on disk

M.-A. Lemburg Thu, 07 Aug 2008 12:29:08 -0700

On 2008-08-07 20:41, Laszlo Nagy wrote:

 Hi,
I'm working on a pivot table. I would like to write it in Python. Iknow, I should be doing that in C, but I would like to create a crossplatform version which can deal with smaller databases (not more than amillion facts).
The data is first imported from a csv file: the user selects whichcolumns contain dimension and measure data (and which columns toignore). In the next step I would like to build up a database that isefficient enough to be used for making pivot tables. Here is my idea forthe database:
Original CSV file with column header and values:

"Color","Year","Make","Price","VMax"
Yellow,2000,Ferrari,100000,254
Blue,2003,Volvo,50000,210

Using the GUI, it is converted to this:

dimensions = [
{ 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green','Yellow' ], },{ 'name':'Year', colindex:1, 'values':[1995,1999,2000,2001,2002,2003,2007 ], },{ 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford','Lamborgini' ], },
]
measures = [
   { 'name', 'Price', 'colindex':3 },
   { 'name', 'Vmax', 'colindex':4 },
]
facts = [
( (3,2,0),(100000.0,254.0) ), # ( dimension_value_indexes,measure_values )
   ( (1,5,1),(50000.0,210.0) ),
  .... # Some million rows or less
]
The core of the idea is that, when using a relatively small number ofpossible values for each dimension, the facts table becomessignificantly smaller and easier to process. (Processing the facts wouldbe: iterate over facts, filter out some of them, create statisticalvalues of the measures, grouped by dimensions.)
The facts table cannot be kept in memory because it is too big. I needto store it on disk, be able to read incrementally, and make statistics.In most cases, the "statistic" will be simple sum of the measures, andcounting the number of facts affected. To be effective, reading thefacts from disk should not involve complex conversions. For this reason,storing in CSV or XML or any textual format would be bad. I'm thinkingabout a binary format, but how can I interface that with Python?
I already looked at:

- xdrlib, which throws me DeprecationWarning when I store some integers
- struct which uses format string for each read operation, I'm concernedabout its speed
What else can I use?


>>> import marshal
>>> marshal.dump(1, open('test.db', 'wb'))
>>> marshal.load(open('test.db', 'rb'))
1

It also very fast at dumping/loading lists, tuples, dictionaries,
floats, etc.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 07 2008)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
--
http://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to store ints and floats on disk

Reply via email to