On Mon, Nov 13, 2017 at 2:09 AM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> Hi!
>
> On Mon, Nov 13, 2017 at 6:47 AM, Connor Wolf <
> conn...@imaginaryindustries.com> wrote:
>
>> Ok, I've managed to get my custom index working.
>>
>
> Good!
>
> It's all on github here: https://github.com/fake-name/pg-spgist_hamming,
>> if anyone else needs a fuzzy-image searching system
>> that can integrate into postgresql..
>>
>> It should be a pretty good basis for anyone else to use if they want to
>> implement a SP-GiST index too.
>>
>
> I took a look at the code, and I feel myself a bit confused :)
> It appears that you're indexing int8 values.  That seems like unrealistic
> short representation for image signature.
>

It is a int8, and nope, it's a surprisingly robust and functional
signature. There's a document describing the hashing mechanism here:
http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html

Functionally, the procedure is relatively simple:

   - Convert to greyscale
   - Resize to intermediate resolution (32x32 is common)
   - Perform DCT on 32x32 image.
   - Crop 32x32 image to 8x8 by throwing away the high-frequency components
   - Threshold the 8x8 image by it's average
   - Serialize the 64 binary values into a int8

In my case, the actual implementation is here:
https://github.com/fake-name/IntraArchiveDeduplicator/blob/master/scanner/hashFile.py#L95-L102


> Also, name of repository make me think that hamming distance would be used
> to compare signatures.  But after look at the code, I see that plain
> absolute value of difference is used for that purpose.
>
> static double
> getDistance(Datum v1, Datum v2)
> {
>     int64_t a1 = DatumGetInt64(v1);
>     int64_t a2 = DatumGetInt64(v2);
>     int64_t diff = Abs(a1 - a2);
>     fprintf_to_ereport("getDistance %ld <-> %ld : %ld", a1, a2, diff);
>     return diff;
> }
>
> For such notion of distance, you don't need a VP-tree or another complex
> indexing.  B-tree is quite enough in this case.  Alternatively, distance
> function is not what it meant to be.
>
>
You're looking in the wrong place.

https://github.com/fake-name/pg-spgist_hamming/tree/master/vptree is the
code you sent me, with some simplification to make it only work on single
integers. Basically,
before I started on my own stuff, I wanted to make sure I could at least
implement a functional index using a much more basic structure.

https://github.com/fake-name/pg-spgist_hamming/tree/master/bktree is the
actual BK-tree index, and it does indeed use hamming distance for the
search metric:

static int64_t
f_hamming(int64_t a_int, int64_t b_int)
{
/*
Compute number of bits that are not common between `a` and `b`.
return value is a plain integer
*/
uint64_t x = (a_int ^ b_int);
uint64_t ret = __builtin_popcountll (x);

return ret;
}
(see
https://github.com/fake-name/pg-spgist_hamming/blob/master/bktree/bktree.c#L38-L58
).


> It would be useful if you provide complete usage example of this
> extension: from image to signature conversion to search queries.
>



Actual usage is done with this project:
https://github.com/fake-name/IntraArchiveDeduplicator, which
also has the older in-memory BK tree
<https://github.com/fake-name/IntraArchiveDeduplicator/blob/master/deduplicator/bktree.hpp>
I've
implemented, and it's actually used here
<https://github.com/fake-name/IntraArchiveDeduplicator/blob/master/dbPhashApi.py#L173>
.
I also have unit tests that sit on top of this here
<https://github.com/fake-name/IntraArchiveDeduplicator/tree/master/Tests> (see
all the files that are named "Test_db_BKTree...".

Connor

Reply via email to