Re: genomic locus

Gene Selkov Fri, 22 Dec 2017 15:39:07 -0800

> On Dec 22, 2017, at 1:53 AM, Teodor Sigaev <[email protected]> wrote:
> 
> Hmm, would you try to implement separate type for querying? Similar to 
> tsquery, lquery (for ltree), jsquery etc.


That sounds like a good idea if I want to make an app that will only be 
accessed through a purpose-built front end. Now I’m messing with lots of data, 
making numerous small one-off experiments. Since all of that stuff consumes my 
brain power and keystrokes, I want to minimize all the ancillary stuff or at 
least make it invisible. But I’ll probably go ahead and add a separate 
query-friendly type  that if nothing else helps.

I think I can wrangle this type into GiST just by tweaking consistent(), 
union(), and picksplit(), if I manage to express my needs in C without breaking 
too many things. My first attempt segfaulted.

Here’s my plan of attack. I think that by setting a union of inconsistent loci 
(those on different contigs) to [0, MAX_INT], I will expose such union to a 
huge penalty, even without having to do anything with the current penalty(). No 
query will go there.

The required change to consistent() is obvious: contigs do not match — go away.

The situation with picksplit() may be more tricky; I can’t imagine all possible 
consequences until I’ve seen how it works. Currently, with simple intervals, it 
sorts them by center points and splits the sorted list in half. I have already 
changed the internal sort function, picksplit_item_cmp(), to make sure the data 
are sorted on the contig first, then by center point (or any other geometric 
feature). I am thinking about splitting the list at the first inconsistent 
contig, sending the consistent first part with its well-defined geometry to the 
left page, and filling the right page with whatever remains. If the right page 
has inconsistent contigs, its bounding box will be [0, MAX_INT], and it should 
be again picked for splitting at the next iteration (if I understand the 
algorithm correctly).

If all goes to plan, I will end up with an index tree partitioned by contig at 
the top level and geometrically down from there. That will be as close as I can 
get to an array of config-specific indices, without having to store data in 
separate tables.

What do you think of that?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I have a low-level technical question. Because I can’t anticipate the maximum 
length of contig names (and do not want to waste space), I have made the new 
locus type a varlena, like this:

#include "utils/varlena.h"

typedef struct LOCUS
{
  int32 l_len_; /* varlena header (do not touch directly!) */
  int32 start;
  int32 end;
  char  contig[FLEXIBLE_ARRAY_MEMBER];
} LOCUS;

#define LOCUS_SIZE(str) (offsetof(LOCUS, contig) + sizeof(str))

That flexible array member messes with me every time I need to copy it while 
deriving a new locus object from an existing one (or from a pair). What I ended 
up doing is this:

  LOCUS  *l = PG_GETARG_LOCUS_P(0);
  LOCUS  *new_locus;
  char   *contig;
  int    size;
        
  new_locus = (LOCUS *) palloc0(sizeof(*new_locus));
  contig = pstrdup(l->contig); // need this to determine the length of contig 
name at runtime
  size = LOCUS_SIZE(contig);
  SET_VARSIZE(new_locus, size);
  strcpy(new_locus->contig, contig);

Is there a more direct way to clone a varlena structure (possibly assigning an 
differently-sized contig to it)? One that is also memory-safe?


Thank you,

—Gene


> Gene Selkov wrote:
>>> On Dec 17, 2017, at 7:57 PM, Robert Haas <[email protected] 
>>> <mailto:[email protected]> <mailto:[email protected] 
>>> <mailto:[email protected]>>> wrote:
>>> 
>>> On Fri, Dec 15, 2017 at 2:49 PM, Gene Selkov <[email protected] 
>>> <mailto:[email protected]> <mailto:[email protected] 
>>> <mailto:[email protected]>>> wrote:
>>>> I need a data type to represent genomic positions, which will consist of a
>>>> string and a pair of integers with interval logic and access methods. Sort
>>>> of like my seg type, but more straightforward.
>>> 
>>> Have you thought about just using a composite type?
>> Yes, I have. That is sort of what I have been doing; a composite type 
>> certainly gets the job done but I don’t feel it reduces query complexity, at 
>> least from the user’s point of view. Maybe I don’t know enough.
>> Here’s an example of how I imagine a composite genomic locus (conventionally 
>> represented as text ‘:’ integer ‘-‘ integer):
>> CREATE TYPE locus AS (contig text, coord int4range);
>> CREATE TABLE test_locus (
>>   pos locus,
>>   ref text,
>>   alt text,
>>   id text
>> );
>> CREATE INDEX test_locus_coord_ix ON test_locus (((pos).coord));
>> \copy test_locus from test_locus.tab
>> Where test_locus.tab has stuff like:
>> (chr3,"[178916937,178916940]")GAACHP2_PIK3CA_2
>> (chr3,"[178916939,178916948]")AGAAAAGATCHP2_PIK3CA_2
>> (chr3,"[178916940,178916941]")GACHP2_PIK3CA_2
>> (chr3,"[178916943,178916944]")AGCHP2_PIK3CA_2
>> (chr3,"[178916943,178916946]")AAGCHP2_PIK3CA_2
>> (chr3,"[178916943,178916952]")AAGATCCTCCHP2_PIK3CA_2
>> (chr3,"[178916944,178916945]")AGCHP2_PIK3CA_2
>> (chr3,"[178916945,178916946]")GCCHP2_PIK3CA_2
>> (chr3,"[178916945,178916946]")GTCHP2_PIK3CA_2
>> (chr3,"[178916945,178916948]")GATCHP2_PIK3CA_2
>> When the table is loaded, I can pull the subset shown above with this query:
>> SELECT * FROM test_locus WHERE (pos).contig = 'chr3' AND (pos).coord && 
>> '[178916937, 178916948]’;
>>               pos               |    ref    | alt |      id
>> --------------------------------+-----------+-----+---------------
>>  (chr3,"[178916937,178916941)") | GAA       |     | CHP2_PIK3CA_2
>>  (chr3,"[178916939,178916949)") | AGAAAAGAT |     | CHP2_PIK3CA_2
>>  . . . .
>> So far so good. It gets the job done. However, it is only a small step 
>> towards a fully encapsulated, monolithic type I want it to be. The above 
>> query It is marginally better than its atomic-type equivalent:
>> SELECT * FROM test WHERE contig = 'chr3' AND greatest(start, 178916937) <= 
>> least(stop, 178916948);
>>  contig |   start   |   stop    |    ref    | alt |      id
>> --------+-----------+-----------+-----------+-----+---------------
>>  chr3   | 178916937 | 178916940 | GAA       |     | CHP2_PIK3CA_2
>>  chr3   | 178916939 | 178916948 | AGAAAAGAT |     | CHP2_PIK3CA_2
>>  . . . .
>> and it requires addition syntax transformations steps to go from 
>> conventional locus representation 'chr3:178916937-178916940' to composite 
>> '(chr3,"[178916937,178916940]”)’ and back.
>> Of course, the relative benefits of partial encapsulation I achieve by 
>> bundling text with int4range accumulate, compared to (text, int4, int4), as 
>> queries grow more complex. But because the elements of a composite type 
>> still require a separate query term for each of them (unless there is some 
>> magic I am not aware of), the complexity of a typical query I need to run 
>> exceeds my feeble sight-reading capacity. I want things that are 
>> conceptually simple to be expressed in simple terms, if possible.
>> Like so:
>> CREATE EXTENSION locus;
>> CREATE TABLE test_locus (
>>   pos locus,
>>   ref text,
>>   alt text,
>>   id text
>> );
>> \copy test_locus from data/oncomine.hotspot.tab
>> SELECT * FROM test_locus WHERE pos && 'chr3:178916937-178916948';
>>            pos            |    ref    | alt |      id
>> --------------------------+-----------+-----+---------------
>>  chr3:178916937-178916940 | GAA       |     | CHP2_PIK3CA_2
>>  chr3:178916939-178916948 | AGAAAAGAT |     | CHP2_PIK3CA_2
>>  chr3:178916940-178916941 | G         | A   | CHP2_PIK3CA_2
>>  chr3:178916943-178916944 | A         | G   | CHP2_PIK3CA_2
>>  chr3:178916943-178916946 | AAG       |     | CHP2_PIK3CA_2
>>  chr3:178916943-178916952 | AAGATCCTC |     | CHP2_PIK3CA_2
>>  chr3:178916944-178916945 | A         | G   | CHP2_PIK3CA_2
>>  chr3:178916945-178916946 | G         | C   | CHP2_PIK3CA_2
>>  chr3:178916945-178916946 | G         | T   | CHP2_PIK3CA_2
>>  chr3:178916945-178916948 | GAT       |     | CHP2_PIK3CA_2
>> (10 rows)
>> I have encountered some pesky geometry / indexing problems while building 
>> this extension (https://github.com/selkovjr/locus), but I hope I can solve 
>> them at least at the level afforded by the composite type, while keeping the 
>> clean interface of a monolithic type. I understand I could probably achieve 
>> the same cleanliness by defining functions and operators over the complex 
>> type, but by the time I’m done with that, will I have coded about the same 
>> amount of stuff as required to build an extended type?
>> Regards,
>> —Gene
> 
> -- 
> Teodor Sigaev                                   E-mail: [email protected] 
> <mailto:[email protected]>
>                                                   WWW: http://www.sigaev.ru/ 
> <http://www.sigaev.ru/>

Re: genomic locus

Reply via email to