On Wed, Feb 10, 2010 at 1:21 AM, Scott Marlowe <scott.marl...@gmail.com> wrote: > > On Wed, Feb 10, 2010 at 12:11 AM, Steve Atkins <st...@blighty.com> wrote: > > A database isn't really the right way to do full text search for single > > files that big. Even if they'd fit in the database it's way bigger than the > > underlying index types tsquery uses are designed for. > > > > Are you sure that the documents are that big? A single document of that > > size would be 400 times the size of the bible. That's a ridiculously large > > amount of text, most of a small library. > > > > If the answer is "yes, it's really that big and it's really text" then look > > at clucene or, better, hiring a specialist. > > I'm betting it's something like gene sequences or geological samples, > or something other than straight text. But even those bear breaking > down into some kind of simple normalization scheme don't they? >
A single genome is ~ 1.3GB as chars, half that size if you use 4 bits / nucleotide (which should work for at least 90% of the use cases). Simplest design is to store a single reference and then for everything else store deltas from it. On average that should require about about 3-5% of your reference sequence per comparative sample (not counting FKs and indexes). As I mentioned on the list a couple of months ago we are in the middle of stuffing a bunch of molecular data (including entire genomes) into Postgres. If anyone else is doing this I would welcome the opportunity to discuss the issues off list... -- Peter Hunsberger -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general