TEXT column > 1Gb

2023-04-11 Thread Joe Carlson
Hello,

I’ve recently encountered the issue of trying to insert more than 1 Gb into a 
TEXT column. While the docs say TEXT is unlimited length, I had been unaware of 
the 1Gb buffer size limitations.

We can debate whether or not saving something this big in a single column is a 
good idea (spoiler: it isn’t. But not my design and, in fairness, was not 
anticipated when the schema was designed.), I’d like to implement something 
that is not a major disruption and try to keep the mods on the server side. My 
first idea is to have a chunked associated table (in pseudo code)

CREATE TABLE associated(key_id integer references main_table(key_id), chunk 
integer, text_start integer, text_end integer, text_chunk TEXT);

And define functions for inserting and selecting by dividing into 1Mb chunks

CREATE FUNCTION insertText(INTEGER,TEXT) RETURNS INTEGER AS $$
DECLARE
  chunk INTEGER := 0;
  key_id ALIAS for $1;
  the_text ALIAS for $2;
  text_chunk TEXT;
BEGIN
  LOOP
text_chunk := substr(the_text,chunk*100,100);
IF length(text_chunk) = 0 THEN
  EXIT;
END IF;
INSERT INTO associated(key_id,chunk,text_start,text_end,text_chunk) VALUES 
(key_id,chunk,chunk*100,(chunk*100+length(text_chunk)),text_chunk);
chunk := chunk + 1;
  END LOOP;
  RETURN chunk;
END;
$$ LANGUAGE plpgsql;

This apparently runs into the same issues of buffers size: I get an ‘invalid 
message length’ in the log file and the insert fails. I can see from adding 
notices in the code that I never enter the LOOP; I assume having function 
arguments > 1Gb is also a bad thing.

I’d like to continue to keep the modifications on the server size. And I’d like 
to believe someone else has had this problem before. Any suggestions other than 
have the client do the chunking? Can I use a different language binding and get 
around the argument length limitations?

Thanks





Re: TEXT column > 1Gb

2023-04-12 Thread Joe Carlson
I’m curious what you learned. I’ve been tripping over the buffer allocation 
issue when either splitting input text into chunks or aggregating chunks in 
selects. I’ve decided that I need to move this to client side.

The use case is genomics. Extracting substrings is common. So going to chunked 
storage makes sense.

And we have a menagerie of code to deal with. Legacy Perl for loading. Clients 
in Perl, Java, node, python. A whole zoo. 

Thanks

> On Apr 11, 2023, at 10:51 AM, Rob Sargent  wrote:
> 
> 
>> On 4/11/23 11:41, Joe Carlson wrote:
>> Hello,
>> 
>> I’ve recently encountered the issue of trying to insert more than 1 Gb into 
>> a TEXT column. While the docs say TEXT is unlimited length, I had been 
>> unaware of the 1Gb buffer size limitations.
>> 
>> We can debate whether or not saving something this big in a single column is 
>> a good idea (spoiler: it isn’t. But not my design and, in fairness, was not 
>> anticipated when the schema was designed.), I’d like to implement something 
>> that is not a major disruption and try to keep the mods on the server side. 
>> My first idea is to have a chunked associated table (in pseudo code)
>> 
>> CREATE TABLE associated(key_id integer references main_table(key_id), chunk 
>> integer, text_start integer, text_end integer, text_chunk TEXT);
>> 
>> And define functions for inserting and selecting by dividing into 1Mb chunks
>> 
>> CREATE FUNCTION insertText(INTEGER,TEXT) RETURNS INTEGER AS $$
>> DECLARE
>>   chunk INTEGER := 0;
>>   key_id ALIAS for $1;
>>   the_text ALIAS for $2;
>>   text_chunk TEXT;
>> BEGIN
>>   LOOP
>> text_chunk := substr(the_text,chunk*100,100);
>> IF length(text_chunk) = 0 THEN
>>   EXIT;
>> END IF;
>> INSERT INTO associated(key_id,chunk,text_start,text_end,text_chunk) 
>> VALUES 
>> (key_id,chunk,chunk*100,(chunk*100+length(text_chunk)),text_chunk);
>> chunk := chunk + 1;
>>   END LOOP;
>>   RETURN chunk;
>> END;
>> $$ LANGUAGE plpgsql;
>> 
>> This apparently runs into the same issues of buffers size: I get an ‘invalid 
>> message length’ in the log file and the insert fails. I can see from adding 
>> notices in the code that I never enter the LOOP; I assume having function 
>> arguments > 1Gb is also a bad thing.
>> 
>> I’d like to continue to keep the modifications on the server size. And I’d 
>> like to believe someone else has had this problem before. Any suggestions 
>> other than have the client do the chunking? Can I use a different language 
>> binding and get around the argument length limitations?
>> 
>> Thanks
>> 
>> 
>> 
> I've hit this same limitation in Java (with write to db).  What is your stack 
> in this case?  Not sure my solution applies.
> 
> 


Re: TEXT column > 1Gb

2023-04-12 Thread Joe Carlson
I’ve certainly thought about using a different representation. A factor of 2x 
would be good, for a while anyway. For nucleotide sequence, we’d need to 
consider a 10 character alphabet (A, C, G, T, N and the lower case forms when 
representing ’soft masked’ sequence*). So it would be 2 bases/byte. (Proteins 
are not nearly so long so a straight storage is simpler.) But these would be 
bigger changes on the client side than storing in chunks so I think this is the 
way to go.

We’re working with plant genomes, which compared to human chromosomes, are 
HUGE. One chromosome of fava bean is over a gig. And pine tree is another 
monster. This, together with the fact that sequence data collection and 
assembly have improved so much in the past couple years has forced us to 
rethink a lot of our data storage assumptions. 

* for those curious, especially in plants, much of sequence consists of 
repetitive element that are remnants of ancient viruses, simple repeats and the 
like. For people who want to identify particular functional components in a 
genome, they typically do not want to search against this sequence but restrict 
searching to coding regions. But the repetitive sequence is still important and 
we need to keep it.

> On Apr 12, 2023, at 10:04 AM, Mark Dilger  
> wrote:
> 
> 
> 
>> On Apr 12, 2023, at 7:59 AM, Joe Carlson  wrote:
>> 
>> The use case is genomics. Extracting substrings is common. So going to 
>> chunked storage makes sense.
> 
> Are you storing nucleotide sequences as text strings?  If using the simple 
> 4-character (A,C,G,T) alphabet, you can store four bases per byte.  If using 
> a nucleotide code 16-character alphabet you can still get two bases per byte. 
>  An amino acid 20-character alphabet can be stored 8 bases per 5 bytes, and 
> so forth.  Such a representation might allow you to store sequences two or 
> four times longer than the limit you currently hit, but then you are still at 
> an impasse.  Would a factor or 2x or 4x be enough for your needs? 
> 
> —
> Mark Dilger
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
> 
> 
> 





Re: TEXT column > 1Gb

2023-04-12 Thread Joe Carlson


> On Apr 12, 2023, at 12:21 PM, Rob Sargent  wrote:
> 
> On 4/12/23 13:02, Ron wrote:
>> Must the genome all be in one big file, or can you store them one line per 
>> table row?

The assumption in the schema I’m using is 1 chromosome per record. Chromosomes 
are typically strings of continuous sequence (A, C, G, or T) separated by gaps 
(N) of approximately known, or completely unknown size. In the past this has 
not been a problem since sequenced chromosomes were maybe 100 megabases. But 
sequencing is better now with the technology improvements and tackling more 
complex genomes. So gigabase chromosomes are common. 

A typical use case might be from someone interested in seeing if they can 
identify the regulatory elements (the on or off switches) of a gene. The 
protein coding part of a gene can be predicted pretty reliably, but the 
upstream untranslated region and regulatory elements are tougher. So they might 
come to our web site and want to extract the 5 kb bit of sequence before the 
start of the gene and look for some of the common motifs that signify a protein 
binding site. Being able to quickly pull out a substring of the genome to drive 
a web app is something we want to do quickly. 
> 
> Not sure what OP is doing with plant genomes (other than some genomics) but 
> the tools all use files and pipeline of sub-tools.  In and out of tuples 
> would be expensive.  Very,very little "editing" done in the usual "update 
> table set val where id" sense.

yeah. it’s basically a warehouse. Stick data in, but then make all the 
connections between the functional elements, their products and the predictions 
on the products. It’s definitely more than a document store and we require a 
relational database.
> 
> Lines in a vcf file can have thousands of colums fo nasty, cryptic garbage 
> data that only really makes sense to tools, reader.  Highly denormalized of 
> course.  (Btw, I hate sequencing :) )

Imagine a disciplne where some beleaguered grad student has to get something 
out the door by the end of the term. It gets published and the rest of the 
community say GREAT! we have a standard! Then the abuse of the standard 
happens. People who specialize in bioinformatics know just enough computer 
science, statistics and molecular biology to annoy experts in three different 
fields.



custom background worker task for \copy?

2023-06-02 Thread Joe Carlson
Hi Folks,

I’m writing a little utility for dumping parts of tables into files which will 
be later slurped into another primary instance. The primary source is our 
referential data (big, and subject to heavy loads when adding data), the second 
is a smaller version used in driving our web app. (smaller, portable, less 
prone to lags.)

Yes, a replication strategy can work but since the web app version is so much 
smaller (10% of the size) I thought the partial snapshot would be easier to 
manage.

I have SQL that does it with \copy (select * from  where…) … And that is 
fine. But it would be nice to be able to run the \copy commands in parallel. So 
I was thinking of writing a background worker. 

Never having done that before, I’m curious:
1) Is a background worker that I can execute in parallel appropriate for this 
job
2) Are there non-trivial examples of background workers out there to copy learn 
from?
3) Will doing multiple \copy’s in parallel just be of no benefit. Since pg_dump 
and pg_restore have the options of running multiple instances in parallel I 
thought the answer was it should help.

Thanks

Joe

Re: custom background worker task for \copy?

2023-06-03 Thread Joe Carlson
Thanks. Yes, an option. Was considering that as well. 

Joe

> On Jun 2, 2023, at 9:41 PM, Laurenz Albe  wrote:
> 
> On Fri, 2023-06-02 at 11:36 -0700, Joe Carlson wrote:
>> I have SQL that does it with \copy (select * from  where…) … And that 
>> is fine.
>> But it would be nice to be able to run the \copy commands in parallel.
>> So I was thinking of writing a background worker.
> 
> Why don't you go the easy way of using several database connections to run the
> parallel COPY statements?
> 
> Yours,
> Laurenz Albe