Re: Fastest way to store ints and floats on disk

Emile van Sebille Thu, 07 Aug 2008 15:43:44 -0700

Laszlo Nagy wrote:

 Hi,
I'm working on a pivot table.

Hmm... I wrote an browser based analysis tool and used the working namepyvot...

I found Numeric to provide the best balance of memory footprint andspeed. I also segregated data prep into a separate process to avoidexcessive memory use at run time. Turns out python

For the site I'm at, I've got 10 years sales history recapped from4327846 detail records into 458197 item by customer by month records andtop shows a 240Mb memory footprint. I've got 21 cross indexed selectionfields, and can display up to six data types (qty, price, sqft, cost,gp%, avg). At another site I've got approx 8.5M records recapped into1M records with 15 indexes and 5 years monthly history living in a 540Mbmemory footprint.

It's reasonably quick: a query like 'select san mateo, foster city andsan carlos accounts, sort by customer and product category and displaythis year's sales by month' selects 260 records and renders in thebrowser in about 2 seconds. Or on the larger installation 'Show salesfor the past five years for product group 12 sorted by city withinroute' selects 160 records and renders in about 3 seconds.


My objective was to keep the info in memory for fast response times.

I played a lot of games getting this all to work well, including some cextensions, but Numeric's take, sum, tostring and fromstring ended upwith 'pivotal' roles. :)


Regards,

Emile

I would like to write it in Python. Iknow, I should be doing that in C, but I would like to create a crossplatform version which can deal with smaller databases (not more than amillion facts).
The data is first imported from a csv file: the user selects whichcolumns contain dimension and measure data (and which columns toignore). In the next step I would like to build up a database that isefficient enough to be used for making pivot tables. Here is my idea forthe database:
Original CSV file with column header and values:

"Color","Year","Make","Price","VMax"
Yellow,2000,Ferrari,100000,254
Blue,2003,Volvo,50000,210

Using the GUI, it is converted to this:

dimensions = [
{ 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green','Yellow' ], },{ 'name':'Year', colindex:1, 'values':[1995,1999,2000,2001,2002,2003,2007 ], },{ 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford','Lamborgini' ], },
]
measures = [
   { 'name', 'Price', 'colindex':3 },
   { 'name', 'Vmax', 'colindex':4 },
]
facts = [
( (3,2,0),(100000.0,254.0) ), # ( dimension_value_indexes,measure_values )
   ( (1,5,1),(50000.0,210.0) ),
  .... # Some million rows or less
]
The core of the idea is that, when using a relatively small number ofpossible values for each dimension, the facts table becomessignificantly smaller and easier to process. (Processing the facts wouldbe: iterate over facts, filter out some of them, create statisticalvalues of the measures, grouped by dimensions.)
The facts table cannot be kept in memory because it is too big. I needto store it on disk, be able to read incrementally, and make statistics.In most cases, the "statistic" will be simple sum of the measures, andcounting the number of facts affected. To be effective, reading thefacts from disk should not involve complex conversions. For this reason,storing in CSV or XML or any textual format would be bad. I'm thinkingabout a binary format, but how can I interface that with Python?
I already looked at:

- xdrlib, which throws me DeprecationWarning when I store some integers
- struct which uses format string for each read operation, I'm concernedabout its speed
What else can I use?

Thanks,

  Laszlo



--
http://mail.python.org/mailman/listinfo/python-list


--
http://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to store ints and floats on disk

Reply via email to