Re: [HACKERS] Adding a suffix array index

Hannu Krosing Fri, 19 Nov 2004 04:40:11 -0800

On R, 2004-11-19 at 12:42, Troels Arvin wrote:
> The basic parts of the type are pretty much done. Those interested may
> have a look at http://troels.arvin.dk/svn-snap/postgresql-dnaseq/ (the
> code organization needs some clean-up). The basic type implementation
> should be improved by adding more string functions and by implementing a
> set of specialized selectivity functions.


I cant answer your immediate questions, just rant on general issues ;)

> Part of my current code concerns packing DNA characters: As the alphabet 
> of DNA strings is very small (four characters), it seems like a 
> straigt-forward optimization to store each character in two bits. 

My advice would be to get it to work first, oprimize later.

Thus I guess you could start by storing DNA sequences as character
strings.

> A warning: This is my first C project, so please
> don't laugh too much (publicly) if you find strange constructs in my code...

Then even more so - get the novel and generally interesting part
(indexing huge arrays) right first, and optimise for space (compressing
4 chars into 1) later.

You could do this 4->1 compression when storing the string into tuple,
but I strongly recommend doing actual work (indexing/searching) at a
level that C directly supports (i.e. bytes/characters)

This enables you to get the basics right first without distraction from
all bit-shifting inside bytes. A good tuned algorithm will almost
certainly offset the 4 time space disadvantage.

...

> My first and most immediate goal is to support efficient answering of a
> question like "which rows contain the sequence TTGACCACTTG in column foo?".

If you store your sequences as strings, you may try to use trigrams (or
modify them to 4,5,6 or 7-grams ;) to get some feel how that works.

trigram module is in contrib/pg_trgm.

------------
Hannu


---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Adding a suffix array index

Reply via email to