Hi Alan, Thank you very much for immediate response.I didnt get the point of loading the data using SQL.What does that mean?Does it mean Bulk file load.If so then there will I/O performance hit right.Writing the parsed tokens in the csv format in a file.After that only I can bulk load this file. Do have any idea about the C api for Postgresql and some documentation to use it? Thanks and Regards, Shadab. --- Alan Gauld <[EMAIL PROTECTED]> wrote:
> "Shadab Sayani" <[EMAIL PROTECTED]> wrote > > > The data I need to deal with is in 100s of GB. > > I am using postgresql backend and SQLALCHEMY ORM. > > All ORMs will introduce a significant performance > hit. > If you really need high speed, and populating a > 100G+ database > probably is included in that, then you should look > at raw SQL. > In fact for really big data loads most databases > provide a > C api that goes under the SQL, because even SQL is > relatively slow. > > As an example, we use a large Oracle database at > work. > Loading about 3 Terabytes of data through an ORM > took > over 3 months! Loading it through SQL took about 3 > days. > Loading it through the C API took less than a day. > > Your mileage may vary since a lot depends on locks, > indexes etc etc. And of course the server spec! > > But for loading large data volumes get as close to > the metal > as possible. Once the data is loaded you can use the > > ORM to simplify the application code for extracting > and > modifying the data. > > > I need to read the bio datafiles and parse them > and > > then store them in database. > > Parsing them and preparing the SQL statements can > be done in Python. But the actual loading I suggest > should be done in SQL if possible (The C aPI should > be a last resort - its frought with danger!) > > > Please suggest some viable solution to handle such > > > enormous data from python. > > A few hundred gigabytes is not too enormous these > days > but you are never going to achieve times of less > than hours. > You do need to be realistic about that. And if you > are using > a standard PC spec server instead of a large multi > CPU > box with SCSI/RAID disc arrays etc then you could be > > looking at days. > > The other important factor is your data schema. The > more > tables, joins, indexes etc the database has to > maintain the > more work it takes and the sloewer it gets. The 3TB > example > I gave had over 2000 tables, so it was always going > to be > slow. If you have a single unindexed table then it > will be > much simpler. (But the queries later will be much > harder!) > > -- > Alan Gauld > Author of the Learn to Program web site > http://www.freenetpages.co.uk/hp/alan.gauld > > _______________________________________________ > Tutor maillist - [email protected] > http://mail.python.org/mailman/listinfo/tutor > Send instant messages to your online friends http://uk.messenger.yahoo.com _______________________________________________ Tutor maillist - [email protected] http://mail.python.org/mailman/listinfo/tutor
