https://bugs.kde.org/show_bug.cgi?id=496238

            Bug ID: 496238
           Summary: Similarity search engine will not find effectively
                    identical images if minor variance exists (i.e.
                    contrast)
    Classification: Applications
           Product: digikam
           Version: 8.5.0
          Platform: Microsoft Windows
                OS: Microsoft Windows
            Status: REPORTED
          Severity: normal
          Priority: NOR
         Component: Searches-Engine
          Assignee: digikam-bugs-n...@kde.org
          Reporter: carbonwer...@hotmail.com
  Target Milestone: ---

***
If you're not sure this is actually a bug, instead post about it at
https://discuss.kde.org

If you're reporting a crash, attach a backtrace with debug symbols; see
https://community.kde.org/Guidelines_and_HOWTOs/Debugging/How_to_create_useful_crash_reports

Please remove this comment after reading and before submitting - thanks!
***

SUMMARY
In many cases, I have multiple instances of a photo in an album, where perhaps
the content was merged with one set preprocessed (i.e. converted to another
format, or auto contrast adjusted etc) where the similarity detection engine
misses these- even with the range set to some absurdly low value (50%)

STEPS TO REPRODUCE
1. As above, just load several variant of an image- maybe a few pixels crop
difference, or a contrast change. Not sure of which things the engine is most
sensitive to. 
2. Select on, and right click select 'find similar'


OBSERVED RESULT
System may return a subset of the 'duplicate' variants, or all, or none, again
depending on whatever variable the system seems to care about most. Photos that
are very easily detected via Image Dedup etc are missed. 

EXPECTED RESULT
While I dont have any expectation that the similarity engine will detect things
like duplicate images that are mirrored or rotated, I would hope it could
detect duplicate images that, for example, have one as original color, and
another identical geometrically but with a different compression type. And I
would certainly expect that it could catch much more minor variations- just a
small % change in contrast or brightness. And where it seems to catch some,
there are instances where I have 3-4 variants of the same image which appear to
a human as near identical that the system misses, where conventional dedup
software like AllDup will find rapidly, and there, something like AllDup doesnt
even have the advantage of starting with a known source- it is doing an
all-to-all comparison, which is so CPU intensive that the idea of perhaps
checking via multiple hash routines would not be feasible for a simple 'find
similar' routine for a single photo...

SOFTWARE/OS VERSIONS
Windows: 
macOS: 
(available in the Info Center app, or by running `kinfo` in a terminal window)
Linux/KDE Plasma: 
KDE Plasma Version: 
KDE Frameworks Version: 
Qt Version: 

ADDITIONAL INFORMATION
I understand that the position of the devs here may be that this is for true
duplicate removal, where the photos are identical but have different names, or
where one is raw and another is in a lossless compression conversion. But the
reality is that any type of merge of legacy content may bring in stuff that is
fundamentally identical to the eye. If the point of a similarity range is to
permit variance in color or crop or contrast etc, it seems to not be nearly as
effective as it should be. 

I wonder if it might make sense to use several hash types for the initial
fingerprinting, where open source modules for pHash, dHash, aHash etc are out
there, and where perhaps enabling a search area as full vs center (cropping
would be less impactful) could then be user selectable? Since this is a
one-to-many test vs a many-to-many test, it would remain quick for the user,
but it would be much more capable (read: miss far fewer of what most of us
would consider duplicate images).

-- 
You are receiving this mail because:
You are watching all bug changes.

Reply via email to