On R, 2004-11-19 at 12:42, Troels Arvin wrote: > The basic parts of the type are pretty much done. Those interested may > have a look at http://troels.arvin.dk/svn-snap/postgresql-dnaseq/ (the > code organization needs some clean-up). The basic type implementation > should be improved by adding more string functions and by implementing a > set of specialized selectivity functions.
I cant answer your immediate questions, just rant on general issues ;) > Part of my current code concerns packing DNA characters: As the alphabet > of DNA strings is very small (four characters), it seems like a > straigt-forward optimization to store each character in two bits. My advice would be to get it to work first, oprimize later. Thus I guess you could start by storing DNA sequences as character strings. > A warning: This is my first C project, so please > don't laugh too much (publicly) if you find strange constructs in my code... Then even more so - get the novel and generally interesting part (indexing huge arrays) right first, and optimise for space (compressing 4 chars into 1) later. You could do this 4->1 compression when storing the string into tuple, but I strongly recommend doing actual work (indexing/searching) at a level that C directly supports (i.e. bytes/characters) This enables you to get the basics right first without distraction from all bit-shifting inside bytes. A good tuned algorithm will almost certainly offset the 4 time space disadvantage. ... > My first and most immediate goal is to support efficient answering of a > question like "which rows contain the sequence TTGACCACTTG in column foo?". If you store your sequences as strings, you may try to use trigrams (or modify them to 4,5,6 or 7-grams ;) to get some feel how that works. trigram module is in contrib/pg_trgm. ------------ Hannu ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings