> On Apr 12, 2023, at 12:21 PM, Rob Sargent <robjsarg...@gmail.com> wrote:
> 
> On 4/12/23 13:02, Ron wrote:
>> Must the genome all be in one big file, or can you store them one line per 
>> table row?

The assumption in the schema I’m using is 1 chromosome per record. Chromosomes 
are typically strings of continuous sequence (A, C, G, or T) separated by gaps 
(N) of approximately known, or completely unknown size. In the past this has 
not been a problem since sequenced chromosomes were maybe 100 megabases. But 
sequencing is better now with the technology improvements and tackling more 
complex genomes. So gigabase chromosomes are common. 

A typical use case might be from someone interested in seeing if they can 
identify the regulatory elements (the on or off switches) of a gene. The 
protein coding part of a gene can be predicted pretty reliably, but the 
upstream untranslated region and regulatory elements are tougher. So they might 
come to our web site and want to extract the 5 kb bit of sequence before the 
start of the gene and look for some of the common motifs that signify a protein 
binding site. Being able to quickly pull out a substring of the genome to drive 
a web app is something we want to do quickly. 
> 
> Not sure what OP is doing with plant genomes (other than some genomics) but 
> the tools all use files and pipeline of sub-tools.  In and out of tuples 
> would be expensive.  Very,very little "editing" done in the usual "update 
> table set val where id" sense.

yeah. it’s basically a warehouse. Stick data in, but then make all the 
connections between the functional elements, their products and the predictions 
on the products. It’s definitely more than a document store and we require a 
relational database.
> 
> Lines in a vcf file can have thousands of colums fo nasty, cryptic garbage 
> data that only really makes sense to tools, reader.  Highly denormalized of 
> course.  (Btw, I hate sequencing :) )

Imagine a disciplne where some beleaguered grad student has to get something 
out the door by the end of the term. It gets published and the rest of the 
community say GREAT! we have a standard! Then the abuse of the standard 
happens. People who specialize in bioinformatics know just enough computer 
science, statistics and molecular biology to annoy experts in three different 
fields.

Reply via email to