https://bugs.kde.org/show_bug.cgi?id=496238
Bug ID: 496238 Summary: Similarity search engine will not find effectively identical images if minor variance exists (i.e. contrast) Classification: Applications Product: digikam Version: 8.5.0 Platform: Microsoft Windows OS: Microsoft Windows Status: REPORTED Severity: normal Priority: NOR Component: Searches-Engine Assignee: digikam-bugs-n...@kde.org Reporter: carbonwer...@hotmail.com Target Milestone: --- *** If you're not sure this is actually a bug, instead post about it at https://discuss.kde.org If you're reporting a crash, attach a backtrace with debug symbols; see https://community.kde.org/Guidelines_and_HOWTOs/Debugging/How_to_create_useful_crash_reports Please remove this comment after reading and before submitting - thanks! *** SUMMARY In many cases, I have multiple instances of a photo in an album, where perhaps the content was merged with one set preprocessed (i.e. converted to another format, or auto contrast adjusted etc) where the similarity detection engine misses these- even with the range set to some absurdly low value (50%) STEPS TO REPRODUCE 1. As above, just load several variant of an image- maybe a few pixels crop difference, or a contrast change. Not sure of which things the engine is most sensitive to. 2. Select on, and right click select 'find similar' OBSERVED RESULT System may return a subset of the 'duplicate' variants, or all, or none, again depending on whatever variable the system seems to care about most. Photos that are very easily detected via Image Dedup etc are missed. EXPECTED RESULT While I dont have any expectation that the similarity engine will detect things like duplicate images that are mirrored or rotated, I would hope it could detect duplicate images that, for example, have one as original color, and another identical geometrically but with a different compression type. And I would certainly expect that it could catch much more minor variations- just a small % change in contrast or brightness. And where it seems to catch some, there are instances where I have 3-4 variants of the same image which appear to a human as near identical that the system misses, where conventional dedup software like AllDup will find rapidly, and there, something like AllDup doesnt even have the advantage of starting with a known source- it is doing an all-to-all comparison, which is so CPU intensive that the idea of perhaps checking via multiple hash routines would not be feasible for a simple 'find similar' routine for a single photo... SOFTWARE/OS VERSIONS Windows: macOS: (available in the Info Center app, or by running `kinfo` in a terminal window) Linux/KDE Plasma: KDE Plasma Version: KDE Frameworks Version: Qt Version: ADDITIONAL INFORMATION I understand that the position of the devs here may be that this is for true duplicate removal, where the photos are identical but have different names, or where one is raw and another is in a lossless compression conversion. But the reality is that any type of merge of legacy content may bring in stuff that is fundamentally identical to the eye. If the point of a similarity range is to permit variance in color or crop or contrast etc, it seems to not be nearly as effective as it should be. I wonder if it might make sense to use several hash types for the initial fingerprinting, where open source modules for pHash, dHash, aHash etc are out there, and where perhaps enabling a search area as full vs center (cropping would be less impactful) could then be user selectable? Since this is a one-to-many test vs a many-to-many test, it would remain quick for the user, but it would be much more capable (read: miss far fewer of what most of us would consider duplicate images). -- You are receiving this mail because: You are watching all bug changes.