> On Dec 22, 2017, at 1:53 AM, Teodor Sigaev <teo...@sigaev.ru> wrote: > > Hmm, would you try to implement separate type for querying? Similar to > tsquery, lquery (for ltree), jsquery etc.
That sounds like a good idea if I want to make an app that will only be accessed through a purpose-built front end. Now I’m messing with lots of data, making numerous small one-off experiments. Since all of that stuff consumes my brain power and keystrokes, I want to minimize all the ancillary stuff or at least make it invisible. But I’ll probably go ahead and add a separate query-friendly type that if nothing else helps. I think I can wrangle this type into GiST just by tweaking consistent(), union(), and picksplit(), if I manage to express my needs in C without breaking too many things. My first attempt segfaulted. Here’s my plan of attack. I think that by setting a union of inconsistent loci (those on different contigs) to [0, MAX_INT], I will expose such union to a huge penalty, even without having to do anything with the current penalty(). No query will go there. The required change to consistent() is obvious: contigs do not match — go away. The situation with picksplit() may be more tricky; I can’t imagine all possible consequences until I’ve seen how it works. Currently, with simple intervals, it sorts them by center points and splits the sorted list in half. I have already changed the internal sort function, picksplit_item_cmp(), to make sure the data are sorted on the contig first, then by center point (or any other geometric feature). I am thinking about splitting the list at the first inconsistent contig, sending the consistent first part with its well-defined geometry to the left page, and filling the right page with whatever remains. If the right page has inconsistent contigs, its bounding box will be [0, MAX_INT], and it should be again picked for splitting at the next iteration (if I understand the algorithm correctly). If all goes to plan, I will end up with an index tree partitioned by contig at the top level and geometrically down from there. That will be as close as I can get to an array of config-specific indices, without having to store data in separate tables. What do you think of that? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ I have a low-level technical question. Because I can’t anticipate the maximum length of contig names (and do not want to waste space), I have made the new locus type a varlena, like this: #include "utils/varlena.h" typedef struct LOCUS { int32 l_len_; /* varlena header (do not touch directly!) */ int32 start; int32 end; char contig[FLEXIBLE_ARRAY_MEMBER]; } LOCUS; #define LOCUS_SIZE(str) (offsetof(LOCUS, contig) + sizeof(str)) That flexible array member messes with me every time I need to copy it while deriving a new locus object from an existing one (or from a pair). What I ended up doing is this: LOCUS *l = PG_GETARG_LOCUS_P(0); LOCUS *new_locus; char *contig; int size; new_locus = (LOCUS *) palloc0(sizeof(*new_locus)); contig = pstrdup(l->contig); // need this to determine the length of contig name at runtime size = LOCUS_SIZE(contig); SET_VARSIZE(new_locus, size); strcpy(new_locus->contig, contig); Is there a more direct way to clone a varlena structure (possibly assigning an differently-sized contig to it)? One that is also memory-safe? Thank you, —Gene > Gene Selkov wrote: >>> On Dec 17, 2017, at 7:57 PM, Robert Haas <robertmh...@gmail.com >>> <mailto:robertmh...@gmail.com> <mailto:robertmh...@gmail.com >>> <mailto:robertmh...@gmail.com>>> wrote: >>> >>> On Fri, Dec 15, 2017 at 2:49 PM, Gene Selkov <selko...@gmail.com >>> <mailto:selko...@gmail.com> <mailto:selko...@gmail.com >>> <mailto:selko...@gmail.com>>> wrote: >>>> I need a data type to represent genomic positions, which will consist of a >>>> string and a pair of integers with interval logic and access methods. Sort >>>> of like my seg type, but more straightforward. >>> >>> Have you thought about just using a composite type? >> Yes, I have. That is sort of what I have been doing; a composite type >> certainly gets the job done but I don’t feel it reduces query complexity, at >> least from the user’s point of view. Maybe I don’t know enough. >> Here’s an example of how I imagine a composite genomic locus (conventionally >> represented as text ‘:’ integer ‘-‘ integer): >> CREATE TYPE locus AS (contig text, coord int4range); >> CREATE TABLE test_locus ( >> pos locus, >> ref text, >> alt text, >> id text >> ); >> CREATE INDEX test_locus_coord_ix ON test_locus (((pos).coord)); >> \copy test_locus from test_locus.tab >> Where test_locus.tab has stuff like: >> (chr3,"[178916937,178916940]")GAACHP2_PIK3CA_2 >> (chr3,"[178916939,178916948]")AGAAAAGATCHP2_PIK3CA_2 >> (chr3,"[178916940,178916941]")GACHP2_PIK3CA_2 >> (chr3,"[178916943,178916944]")AGCHP2_PIK3CA_2 >> (chr3,"[178916943,178916946]")AAGCHP2_PIK3CA_2 >> (chr3,"[178916943,178916952]")AAGATCCTCCHP2_PIK3CA_2 >> (chr3,"[178916944,178916945]")AGCHP2_PIK3CA_2 >> (chr3,"[178916945,178916946]")GCCHP2_PIK3CA_2 >> (chr3,"[178916945,178916946]")GTCHP2_PIK3CA_2 >> (chr3,"[178916945,178916948]")GATCHP2_PIK3CA_2 >> When the table is loaded, I can pull the subset shown above with this query: >> SELECT * FROM test_locus WHERE (pos).contig = 'chr3' AND (pos).coord && >> '[178916937, 178916948]’; >> pos | ref | alt | id >> --------------------------------+-----------+-----+--------------- >> (chr3,"[178916937,178916941)") | GAA | | CHP2_PIK3CA_2 >> (chr3,"[178916939,178916949)") | AGAAAAGAT | | CHP2_PIK3CA_2 >> . . . . >> So far so good. It gets the job done. However, it is only a small step >> towards a fully encapsulated, monolithic type I want it to be. The above >> query It is marginally better than its atomic-type equivalent: >> SELECT * FROM test WHERE contig = 'chr3' AND greatest(start, 178916937) <= >> least(stop, 178916948); >> contig | start | stop | ref | alt | id >> --------+-----------+-----------+-----------+-----+--------------- >> chr3 | 178916937 | 178916940 | GAA | | CHP2_PIK3CA_2 >> chr3 | 178916939 | 178916948 | AGAAAAGAT | | CHP2_PIK3CA_2 >> . . . . >> and it requires addition syntax transformations steps to go from >> conventional locus representation 'chr3:178916937-178916940' to composite >> '(chr3,"[178916937,178916940]”)’ and back. >> Of course, the relative benefits of partial encapsulation I achieve by >> bundling text with int4range accumulate, compared to (text, int4, int4), as >> queries grow more complex. But because the elements of a composite type >> still require a separate query term for each of them (unless there is some >> magic I am not aware of), the complexity of a typical query I need to run >> exceeds my feeble sight-reading capacity. I want things that are >> conceptually simple to be expressed in simple terms, if possible. >> Like so: >> CREATE EXTENSION locus; >> CREATE TABLE test_locus ( >> pos locus, >> ref text, >> alt text, >> id text >> ); >> \copy test_locus from data/oncomine.hotspot.tab >> SELECT * FROM test_locus WHERE pos && 'chr3:178916937-178916948'; >> pos | ref | alt | id >> --------------------------+-----------+-----+--------------- >> chr3:178916937-178916940 | GAA | | CHP2_PIK3CA_2 >> chr3:178916939-178916948 | AGAAAAGAT | | CHP2_PIK3CA_2 >> chr3:178916940-178916941 | G | A | CHP2_PIK3CA_2 >> chr3:178916943-178916944 | A | G | CHP2_PIK3CA_2 >> chr3:178916943-178916946 | AAG | | CHP2_PIK3CA_2 >> chr3:178916943-178916952 | AAGATCCTC | | CHP2_PIK3CA_2 >> chr3:178916944-178916945 | A | G | CHP2_PIK3CA_2 >> chr3:178916945-178916946 | G | C | CHP2_PIK3CA_2 >> chr3:178916945-178916946 | G | T | CHP2_PIK3CA_2 >> chr3:178916945-178916948 | GAT | | CHP2_PIK3CA_2 >> (10 rows) >> I have encountered some pesky geometry / indexing problems while building >> this extension (https://github.com/selkovjr/locus), but I hope I can solve >> them at least at the level afforded by the composite type, while keeping the >> clean interface of a monolithic type. I understand I could probably achieve >> the same cleanliness by defining functions and operators over the complex >> type, but by the time I’m done with that, will I have coded about the same >> amount of stuff as required to build an extended type? >> Regards, >> —Gene > > -- > Teodor Sigaev E-mail: teo...@sigaev.ru > <mailto:teo...@sigaev.ru> > WWW: http://www.sigaev.ru/ > <http://www.sigaev.ru/>