On Tue, 28 Jan 2014 21:25:54 -0800, Ayushi Dalmia wrote: > Hello, > > I am trying to implement IBM Model 1. In that I need to create a matrix > of 50000*50000 with double values. Currently I am using dict of dict but > it is unable to support such high dimensions and hence gives memory > error. Any help in this regard will be useful. I understand that I > cannot store the matrix in the RAM but what is the most efficient way to > do this?
This looks to me like a table with columns: word1 (varchar 20) | word2 (varchar 20) | connection (double) might be your best solution, but it's going a huge table (2G5 rows) The primary key is going to be the combination of all 3 columns (or possibly the combination of word1 and word2) and you want indexes on word1 and word2, which will slow down populating the table, but speed up searching it, and I assume that searching is going to be a much more frequent operation than populating. Also, creating a database has the additional advantage that next time you want to use the program for a conversion between two languages that you've previously built the data for, the data already exists in the database, so you don't need to build it again. I imagine you would have either one table for each language pair, or one table for each conversion (treating a->b and b->a as two separate conversions). I'm also guessing that varchar 20 is long enough to hold any of your 50,000 words in either language, that value might need adjusting otherwise. -- Denis McMahon, denismfmcma...@gmail.com -- https://mail.python.org/mailman/listinfo/python-list