photos.

Alex Tweedly Thu, 18 Aug 2016 16:22:38 -0700

Part 2 of a 4-part series on developing simple apps for photo managementand viewing.

[ previously ... Part 1 described the justification and development of avery simple photo viewing app ]

The next issue to deal with is the run-away number of photos, and theamount of disk space taken up by them. I strongly suspect that is atleast partly due to my casual (some would say "disorganized") approachto managing the photos, and the multiple computers they originated fromand are kept on (my desktop, laptop, daughter's laptop, back-up disks,safe copies on other external drives, USB drives previously used tostore / transfer folders of photos, etc.)

So the next step is to find and eliminate (or at least reduce)duplicated photos. Of course, I could simply Google "remove duplicatephotos mac" and follow some of the 382,000 resulting links - but where'sthe fun in that :-)

At least some of those apps do, or claim to do, amazing things - finddifferent resolution or different quality versions of the same photo,etc. - but I don't feel a need to look for those; I just need, initiallyat least, to find the simple, exact duplicates. To give some context, Ihave been using a sample subset of 16,000 out of my approx 55,00 photos;these are mostly low/med resolution (i.e. iPhone or old digital cameraJPEGs, between 200Kb and 1.5Mb each). However. my new camera is rathermore resource-hungry (JPEGs are 24Mb or so - hence the urgency toactually implement some of these ideas that I have been kicking aroundfor a long time :-)

I have a variety of schemes in mind to speed up the process, though eachof them needs to be verified for effectiveness, or indeed necessity.


The basic outline *was*

1. walk through to collect all folder names (i.e. the complete tree(s)within the folder(s) specified by the user)

2. visit each folder in turn to collect details of all (relevant) files- with optimizations for folders/files that haven't changed since theinfo was previously collected

3. partition the files by size; and then reduce the list of files to thepotential duplicates

4. further reduce by file signature (i.e. a small sample of say 12 bytesfrom pre-specified locations)


5. get the md5hash of remaining files, and look for duplicates

6. present the data to the user (!?)

However, some simple benchmarking suggested that this was unnecessarilycomplicated - i.e. I can again remove features, even before they havebeen specified or implemented. The task of detecting and avoidingredundant work in step 2a is not terribly complicated - but it'sdefinitely the most brain-taxing part of the whole problem - and in anycase, won't apply to the first time the app is used. So that part can bedelayed at least until I find out how slow the process is - i.e.hopefully forever.

The need for using MD5 hashes, rather than simply comparing the filescompletely is also questionable. It turns out that calculating an MD5hash of a file takes roughly 10x as long as comparing that file toanother identical one (i.e. the worst case for comparison - comparing toa differing file would complete more quickly). So step 5 can also bedelayed (or avoided) until we determine how often it is likely we willbe matching larger sets of files.

Similarly, step 4 can be delayed (or avoided) until we see how well thefile size works as a partition - and it turns out to do a good job.

Of the 16,073 files, there are 14652 different sizes; of these, 1400sizes have 2 matching files while 10 sizes have 3 files, and theremainder have only a single file.

And it turns out that all 1410 of those are genuine duplicates - i.e.there are no cases of files which have the same size without actuallybeing the same; therefore size is a very effective discriminator forphoto files.Even better - running this simplified algorithm on my 16,000 sampletakes about 20 seconds on my aging Macbook Pro. So I can indeedeliminate all those extra features in steps 2a, 4 and 5.

Part 3 of this series will describe what I did for step 6 above - i.e.how to present this data to the user, how to make it easy to eliminateany duplicates found and how to not make it easy to inadvertently deletefiles you shouldn't.


Part 4 will (probably) describe an app for removing uninteresting photos.

And Part 5 will (perhaps) describe whether or how I found it necessaryto improve the image viewer app described in part 1. The increase inaverage file size from 0.5 Mb to 24 Mb means that the time to transitionfrom one photo to the next has gone from "feels instant" to "hmmm, feelsfairly quick". I'll decide from using the app regularly over the nextweek or two whether "fairly quick" is good enough, or whether it's worthimplementing pre-caching for the adjacent photo(s) to get back the"instant" feel.


-- Alex.

P.S. I *will* get these apps cleaned up and onto revOnline. Soon. Anyday now. RSN. Promise :-)


_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

The Joy of Removing Features - Part 2: Finding / removing duplicate files / photos.

Reply via email to