On Mon, Nov 13, 2017 at 2:09 AM, Alexander Korotkov < a.korot...@postgrespro.ru> wrote:
> Hi! > > On Mon, Nov 13, 2017 at 6:47 AM, Connor Wolf < > conn...@imaginaryindustries.com> wrote: > >> Ok, I've managed to get my custom index working. >> > > Good! > > It's all on github here: https://github.com/fake-name/pg-spgist_hamming, >> if anyone else needs a fuzzy-image searching system >> that can integrate into postgresql.. >> >> It should be a pretty good basis for anyone else to use if they want to >> implement a SP-GiST index too. >> > > I took a look at the code, and I feel myself a bit confused :) > It appears that you're indexing int8 values. That seems like unrealistic > short representation for image signature. > It is a int8, and nope, it's a surprisingly robust and functional signature. There's a document describing the hashing mechanism here: http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html Functionally, the procedure is relatively simple: - Convert to greyscale - Resize to intermediate resolution (32x32 is common) - Perform DCT on 32x32 image. - Crop 32x32 image to 8x8 by throwing away the high-frequency components - Threshold the 8x8 image by it's average - Serialize the 64 binary values into a int8 In my case, the actual implementation is here: https://github.com/fake-name/IntraArchiveDeduplicator/blob/master/scanner/hashFile.py#L95-L102 > Also, name of repository make me think that hamming distance would be used > to compare signatures. But after look at the code, I see that plain > absolute value of difference is used for that purpose. > > static double > getDistance(Datum v1, Datum v2) > { > int64_t a1 = DatumGetInt64(v1); > int64_t a2 = DatumGetInt64(v2); > int64_t diff = Abs(a1 - a2); > fprintf_to_ereport("getDistance %ld <-> %ld : %ld", a1, a2, diff); > return diff; > } > > For such notion of distance, you don't need a VP-tree or another complex > indexing. B-tree is quite enough in this case. Alternatively, distance > function is not what it meant to be. > > You're looking in the wrong place. https://github.com/fake-name/pg-spgist_hamming/tree/master/vptree is the code you sent me, with some simplification to make it only work on single integers. Basically, before I started on my own stuff, I wanted to make sure I could at least implement a functional index using a much more basic structure. https://github.com/fake-name/pg-spgist_hamming/tree/master/bktree is the actual BK-tree index, and it does indeed use hamming distance for the search metric: static int64_t f_hamming(int64_t a_int, int64_t b_int) { /* Compute number of bits that are not common between `a` and `b`. return value is a plain integer */ uint64_t x = (a_int ^ b_int); uint64_t ret = __builtin_popcountll (x); return ret; } (see https://github.com/fake-name/pg-spgist_hamming/blob/master/bktree/bktree.c#L38-L58 ). > It would be useful if you provide complete usage example of this > extension: from image to signature conversion to search queries. > Actual usage is done with this project: https://github.com/fake-name/IntraArchiveDeduplicator, which also has the older in-memory BK tree <https://github.com/fake-name/IntraArchiveDeduplicator/blob/master/deduplicator/bktree.hpp> I've implemented, and it's actually used here <https://github.com/fake-name/IntraArchiveDeduplicator/blob/master/dbPhashApi.py#L173> . I also have unit tests that sit on top of this here <https://github.com/fake-name/IntraArchiveDeduplicator/tree/master/Tests> (see all the files that are named "Test_db_BKTree...". Connor