Hi there!

I'm trying to implement a custom indexing system using the GiST index framework, and I have a few questions. Basically, I'm trying to implement a system that allows me to search across a set of 64 bit integers by hamming distance. This is intended to be used in searching for similar images, where the 64 bit integer is actually a phash <http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html> of an image, and similarity between two images is reflected in the hamming distance between two integers.

Anyways, The appropriate approach here is to use something called a BK tree <http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees>, for which I've written some test code <https://github.com/fake-name/MangaCMS/blob/master/deduplicator/hamDb.py#L27> and I think I have a decent grip of (my test code seems to work, in any event).

That said:

 * Is there a /simple/ piece of example-code for implementing a GiST
   index for a single index? I've been looking through the /contrib/
   directory, particularly the /contrib/btree_gist/ files, but it looks
   like 90% of the complexity there is related to supporting all the
   column types Postgres has, rather then actually tied to producing a
   functional index.
 * Once I have something compiling, how can I check to be sure that I'm
   actually using the indexing module I created? I can enable my
   compiled extension using `CREATE EXTENSION`, but is there an option
   for `CREATE INDEX test_index ON testing USING gist (val);` that lets
   me specify *which* GiST index is actually employed? Is this even a
   valid question?
   This is probably something that's available in one of the system
   tables somewhere, but I haven't had much luck with google finding
   out where.
 * Testing: What's the appropriate way to examine the generated tree
   structure of the index? I certainly went through a few bugs with my
   test tree system (and that was in python!). Is there any way to
   examine the index structure for debugging purposes?
 * Also, it looks like `ereport()` is the proper way to emit debugging
   information. Is this correct?
 * In that vein, is there any way to have information that is on the
   operations of an entire query? Looking at the number of tree nodes
   touched for a scan would be nice (and I would not be surprised if
   there is already a facility for it).

Project code is here <https://github.com/fake-name/pg-spgist_hamming> if anyone is interested, any help would be great. I have very little idea what I'm doing.

Thanks,
Connor

Reply via email to