Hi Alan,
Thank you very much for immediate response.I didnt get
the point of loading the data using SQL.What does that
mean?Does it mean Bulk file load.If so then there will
I/O performance hit right.Writing the parsed tokens in
the csv format in a file.After that only I can bulk
load this file.
Do have any idea about the C api for Postgresql and
some documentation to use it?
Thanks and Regards,
Shadab.
--- Alan Gauld <[EMAIL PROTECTED]> wrote:

> "Shadab Sayani" <[EMAIL PROTECTED]> wrote 
> 
> > The data I need to deal with is  in 100s of GB.
> > I am using postgresql backend and SQLALCHEMY ORM.
> 
> All ORMs will introduce a significant performance
> hit.
> If you really need high speed, and populating a
> 100G+ database 
> probably is included in that, then you should look
> at raw SQL.
> In fact for really big data loads most databases
> provide a 
> C api that goes under the SQL, because even SQL is 
> relatively slow.
> 
> As an example, we use a large Oracle database at
> work.
> Loading about 3 Terabytes of data through an ORM
> took 
> over 3 months! Loading it through SQL took about 3
> days.
> Loading it through the C API took less than a day.
> 
> Your mileage may vary since a lot depends on locks, 
> indexes etc etc. And of course the server spec!
> 
> But for loading large data volumes get as close to
> the metal 
> as possible. Once the data is loaded you can use the
> 
> ORM to simplify the application code for extracting
> and 
> modifying the data.
> 
> > I need  to read the bio datafiles and parse them
> and 
> > then store them in  database.
> 
> Parsing them and preparing the SQL statements can 
> be done in Python. But the actual loading I suggest 
> should be done in SQL if possible (The C aPI should 
> be a last resort - its frought with danger!)
> 
> > Please suggest some viable solution to handle such
> 
> > enormous data from  python.
> 
> A few hundred gigabytes is not too enormous these
> days
> but you are never going to achieve times of less
> than hours.
> You do need to be realistic about that. And if you
> are using
> a standard PC spec server instead of a large multi
> CPU 
> box with SCSI/RAID disc arrays etc then you could be
> 
> looking at days.
> 
> The other important factor is your data schema. The
> more 
> tables, joins, indexes etc the database has to
> maintain the 
> more work it takes and the sloewer it gets. The 3TB
> example 
> I gave had over 2000 tables, so it was always going
> to be 
> slow. If you have a single unindexed table then it
> will be 
> much simpler. (But the queries later will be much
> harder!)
> 
> -- 
> Alan Gauld
> Author of the Learn to Program web site
> http://www.freenetpages.co.uk/hp/alan.gauld
> 
> _______________________________________________
> Tutor maillist  -  [email protected]
> http://mail.python.org/mailman/listinfo/tutor
> 


Send instant messages to your online friends http://uk.messenger.yahoo.com 
_______________________________________________
Tutor maillist  -  [email protected]
http://mail.python.org/mailman/listinfo/tutor

Reply via email to