For documents that long I would seriously consider using large objects and refencing them with their OIDs. Text fields get put in a special location within the database. It's similar (possibly exactly) to using large objects. Also, you can potentially compress them to save space on write and read. 1gb of text is a lot of text.
See https://www.postgresql.org/docs/current/largeobjects.html Thanks, Ben On Wed, Apr 12, 2023, 1:20 PM Joe Carlson <jwcarl...@lbl.gov> wrote: > I’ve certainly thought about using a different representation. A factor of > 2x would be good, for a while anyway. For nucleotide sequence, we’d need to > consider a 10 character alphabet (A, C, G, T, N and the lower case forms > when representing ’soft masked’ sequence*). So it would be 2 bases/byte. > (Proteins are not nearly so long so a straight storage is simpler.) But > these would be bigger changes on the client side than storing in chunks so > I think this is the way to go. > > We’re working with plant genomes, which compared to human chromosomes, are > HUGE. One chromosome of fava bean is over a gig. And pine tree is another > monster. This, together with the fact that sequence data collection and > assembly have improved so much in the past couple years has forced us to > rethink a lot of our data storage assumptions. > > * for those curious, especially in plants, much of sequence consists of > repetitive element that are remnants of ancient viruses, simple repeats and > the like. For people who want to identify particular functional components > in a genome, they typically do not want to search against this sequence but > restrict searching to coding regions. But the repetitive sequence is still > important and we need to keep it. > > > On Apr 12, 2023, at 10:04 AM, Mark Dilger <mark.dil...@enterprisedb.com> > wrote: > > > > > > > >> On Apr 12, 2023, at 7:59 AM, Joe Carlson <jwcarl...@lbl.gov> wrote: > >> > >> The use case is genomics. Extracting substrings is common. So going to > chunked storage makes sense. > > > > Are you storing nucleotide sequences as text strings? If using the > simple 4-character (A,C,G,T) alphabet, you can store four bases per byte. > If using a nucleotide code 16-character alphabet you can still get two > bases per byte. An amino acid 20-character alphabet can be stored 8 bases > per 5 bytes, and so forth. Such a representation might allow you to store > sequences two or four times longer than the limit you currently hit, but > then you are still at an impasse. Would a factor or 2x or 4x be enough for > your needs? > > > > — > > Mark Dilger > > EnterpriseDB: http://www.enterprisedb.com > > The Enterprise PostgreSQL Company > > > > > > > > > >