TEXT column > 1Gb
Hello, I’ve recently encountered the issue of trying to insert more than 1 Gb into a TEXT column. While the docs say TEXT is unlimited length, I had been unaware of the 1Gb buffer size limitations. We can debate whether or not saving something this big in a single column is a good idea (spoiler: it isn’t. But not my design and, in fairness, was not anticipated when the schema was designed.), I’d like to implement something that is not a major disruption and try to keep the mods on the server side. My first idea is to have a chunked associated table (in pseudo code) CREATE TABLE associated(key_id integer references main_table(key_id), chunk integer, text_start integer, text_end integer, text_chunk TEXT); And define functions for inserting and selecting by dividing into 1Mb chunks CREATE FUNCTION insertText(INTEGER,TEXT) RETURNS INTEGER AS $$ DECLARE chunk INTEGER := 0; key_id ALIAS for $1; the_text ALIAS for $2; text_chunk TEXT; BEGIN LOOP text_chunk := substr(the_text,chunk*100,100); IF length(text_chunk) = 0 THEN EXIT; END IF; INSERT INTO associated(key_id,chunk,text_start,text_end,text_chunk) VALUES (key_id,chunk,chunk*100,(chunk*100+length(text_chunk)),text_chunk); chunk := chunk + 1; END LOOP; RETURN chunk; END; $$ LANGUAGE plpgsql; This apparently runs into the same issues of buffers size: I get an ‘invalid message length’ in the log file and the insert fails. I can see from adding notices in the code that I never enter the LOOP; I assume having function arguments > 1Gb is also a bad thing. I’d like to continue to keep the modifications on the server size. And I’d like to believe someone else has had this problem before. Any suggestions other than have the client do the chunking? Can I use a different language binding and get around the argument length limitations? Thanks
Re: TEXT column > 1Gb
I’m curious what you learned. I’ve been tripping over the buffer allocation issue when either splitting input text into chunks or aggregating chunks in selects. I’ve decided that I need to move this to client side. The use case is genomics. Extracting substrings is common. So going to chunked storage makes sense. And we have a menagerie of code to deal with. Legacy Perl for loading. Clients in Perl, Java, node, python. A whole zoo. Thanks > On Apr 11, 2023, at 10:51 AM, Rob Sargent wrote: > > >> On 4/11/23 11:41, Joe Carlson wrote: >> Hello, >> >> I’ve recently encountered the issue of trying to insert more than 1 Gb into >> a TEXT column. While the docs say TEXT is unlimited length, I had been >> unaware of the 1Gb buffer size limitations. >> >> We can debate whether or not saving something this big in a single column is >> a good idea (spoiler: it isn’t. But not my design and, in fairness, was not >> anticipated when the schema was designed.), I’d like to implement something >> that is not a major disruption and try to keep the mods on the server side. >> My first idea is to have a chunked associated table (in pseudo code) >> >> CREATE TABLE associated(key_id integer references main_table(key_id), chunk >> integer, text_start integer, text_end integer, text_chunk TEXT); >> >> And define functions for inserting and selecting by dividing into 1Mb chunks >> >> CREATE FUNCTION insertText(INTEGER,TEXT) RETURNS INTEGER AS $$ >> DECLARE >> chunk INTEGER := 0; >> key_id ALIAS for $1; >> the_text ALIAS for $2; >> text_chunk TEXT; >> BEGIN >> LOOP >> text_chunk := substr(the_text,chunk*100,100); >> IF length(text_chunk) = 0 THEN >> EXIT; >> END IF; >> INSERT INTO associated(key_id,chunk,text_start,text_end,text_chunk) >> VALUES >> (key_id,chunk,chunk*100,(chunk*100+length(text_chunk)),text_chunk); >> chunk := chunk + 1; >> END LOOP; >> RETURN chunk; >> END; >> $$ LANGUAGE plpgsql; >> >> This apparently runs into the same issues of buffers size: I get an ‘invalid >> message length’ in the log file and the insert fails. I can see from adding >> notices in the code that I never enter the LOOP; I assume having function >> arguments > 1Gb is also a bad thing. >> >> I’d like to continue to keep the modifications on the server size. And I’d >> like to believe someone else has had this problem before. Any suggestions >> other than have the client do the chunking? Can I use a different language >> binding and get around the argument length limitations? >> >> Thanks >> >> >> > I've hit this same limitation in Java (with write to db). What is your stack > in this case? Not sure my solution applies. > >
Re: TEXT column > 1Gb
I’ve certainly thought about using a different representation. A factor of 2x would be good, for a while anyway. For nucleotide sequence, we’d need to consider a 10 character alphabet (A, C, G, T, N and the lower case forms when representing ’soft masked’ sequence*). So it would be 2 bases/byte. (Proteins are not nearly so long so a straight storage is simpler.) But these would be bigger changes on the client side than storing in chunks so I think this is the way to go. We’re working with plant genomes, which compared to human chromosomes, are HUGE. One chromosome of fava bean is over a gig. And pine tree is another monster. This, together with the fact that sequence data collection and assembly have improved so much in the past couple years has forced us to rethink a lot of our data storage assumptions. * for those curious, especially in plants, much of sequence consists of repetitive element that are remnants of ancient viruses, simple repeats and the like. For people who want to identify particular functional components in a genome, they typically do not want to search against this sequence but restrict searching to coding regions. But the repetitive sequence is still important and we need to keep it. > On Apr 12, 2023, at 10:04 AM, Mark Dilger > wrote: > > > >> On Apr 12, 2023, at 7:59 AM, Joe Carlson wrote: >> >> The use case is genomics. Extracting substrings is common. So going to >> chunked storage makes sense. > > Are you storing nucleotide sequences as text strings? If using the simple > 4-character (A,C,G,T) alphabet, you can store four bases per byte. If using > a nucleotide code 16-character alphabet you can still get two bases per byte. > An amino acid 20-character alphabet can be stored 8 bases per 5 bytes, and > so forth. Such a representation might allow you to store sequences two or > four times longer than the limit you currently hit, but then you are still at > an impasse. Would a factor or 2x or 4x be enough for your needs? > > — > Mark Dilger > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > >
Re: TEXT column > 1Gb
> On Apr 12, 2023, at 12:21 PM, Rob Sargent wrote: > > On 4/12/23 13:02, Ron wrote: >> Must the genome all be in one big file, or can you store them one line per >> table row? The assumption in the schema I’m using is 1 chromosome per record. Chromosomes are typically strings of continuous sequence (A, C, G, or T) separated by gaps (N) of approximately known, or completely unknown size. In the past this has not been a problem since sequenced chromosomes were maybe 100 megabases. But sequencing is better now with the technology improvements and tackling more complex genomes. So gigabase chromosomes are common. A typical use case might be from someone interested in seeing if they can identify the regulatory elements (the on or off switches) of a gene. The protein coding part of a gene can be predicted pretty reliably, but the upstream untranslated region and regulatory elements are tougher. So they might come to our web site and want to extract the 5 kb bit of sequence before the start of the gene and look for some of the common motifs that signify a protein binding site. Being able to quickly pull out a substring of the genome to drive a web app is something we want to do quickly. > > Not sure what OP is doing with plant genomes (other than some genomics) but > the tools all use files and pipeline of sub-tools. In and out of tuples > would be expensive. Very,very little "editing" done in the usual "update > table set val where id" sense. yeah. it’s basically a warehouse. Stick data in, but then make all the connections between the functional elements, their products and the predictions on the products. It’s definitely more than a document store and we require a relational database. > > Lines in a vcf file can have thousands of colums fo nasty, cryptic garbage > data that only really makes sense to tools, reader. Highly denormalized of > course. (Btw, I hate sequencing :) ) Imagine a disciplne where some beleaguered grad student has to get something out the door by the end of the term. It gets published and the rest of the community say GREAT! we have a standard! Then the abuse of the standard happens. People who specialize in bioinformatics know just enough computer science, statistics and molecular biology to annoy experts in three different fields.
custom background worker task for \copy?
Hi Folks, I’m writing a little utility for dumping parts of tables into files which will be later slurped into another primary instance. The primary source is our referential data (big, and subject to heavy loads when adding data), the second is a smaller version used in driving our web app. (smaller, portable, less prone to lags.) Yes, a replication strategy can work but since the web app version is so much smaller (10% of the size) I thought the partial snapshot would be easier to manage. I have SQL that does it with \copy (select * from where…) … And that is fine. But it would be nice to be able to run the \copy commands in parallel. So I was thinking of writing a background worker. Never having done that before, I’m curious: 1) Is a background worker that I can execute in parallel appropriate for this job 2) Are there non-trivial examples of background workers out there to copy learn from? 3) Will doing multiple \copy’s in parallel just be of no benefit. Since pg_dump and pg_restore have the options of running multiple instances in parallel I thought the answer was it should help. Thanks Joe
Re: custom background worker task for \copy?
Thanks. Yes, an option. Was considering that as well. Joe > On Jun 2, 2023, at 9:41 PM, Laurenz Albe wrote: > > On Fri, 2023-06-02 at 11:36 -0700, Joe Carlson wrote: >> I have SQL that does it with \copy (select * from where…) … And that >> is fine. >> But it would be nice to be able to run the \copy commands in parallel. >> So I was thinking of writing a background worker. > > Why don't you go the easy way of using several database connections to run the > parallel COPY statements? > > Yours, > Laurenz Albe