Part 2 of a 4-part series on developing simple apps for photo management and viewing.

[ previously ... Part 1 described the justification and development of a very simple photo viewing app ]

The next issue to deal with is the run-away number of photos, and the amount of disk space taken up by them. I strongly suspect that is at least partly due to my casual (some would say "disorganized") approach to managing the photos, and the multiple computers they originated from and are kept on (my desktop, laptop, daughter's laptop, back-up disks, safe copies on other external drives, USB drives previously used to store / transfer folders of photos, etc.)

So the next step is to find and eliminate (or at least reduce) duplicated photos. Of course, I could simply Google "remove duplicate photos mac" and follow some of the 382,000 resulting links - but where's the fun in that :-)

At least some of those apps do, or claim to do, amazing things - find different resolution or different quality versions of the same photo, etc. - but I don't feel a need to look for those; I just need, initially at least, to find the simple, exact duplicates. To give some context, I have been using a sample subset of 16,000 out of my approx 55,00 photos; these are mostly low/med resolution (i.e. iPhone or old digital camera JPEGs, between 200Kb and 1.5Mb each). However. my new camera is rather more resource-hungry (JPEGs are 24Mb or so - hence the urgency to actually implement some of these ideas that I have been kicking around for a long time :-)

I have a variety of schemes in mind to speed up the process, though each of them needs to be verified for effectiveness, or indeed necessity.

The basic outline *was*

1. walk through to collect all folder names (i.e. the complete tree(s) within the folder(s) specified by the user)

2. visit each folder in turn to collect details of all (relevant) files - with optimizations for folders/files that haven't changed since the info was previously collected

3. partition the files by size; and then reduce the list of files to the potential duplicates

4. further reduce by file signature (i.e. a small sample of say 12 bytes from pre-specified locations)

5. get the md5hash of remaining files, and look for duplicates

6. present the data to the user (!?)

However, some simple benchmarking suggested that this was unnecessarily complicated - i.e. I can again remove features, even before they have been specified or implemented. The task of detecting and avoiding redundant work in step 2a is not terribly complicated - but it's definitely the most brain-taxing part of the whole problem - and in any case, won't apply to the first time the app is used. So that part can be delayed at least until I find out how slow the process is - i.e. hopefully forever.

The need for using MD5 hashes, rather than simply comparing the files completely is also questionable. It turns out that calculating an MD5 hash of a file takes roughly 10x as long as comparing that file to another identical one (i.e. the worst case for comparison - comparing to a differing file would complete more quickly). So step 5 can also be delayed (or avoided) until we determine how often it is likely we will be matching larger sets of files.

Similarly, step 4 can be delayed (or avoided) until we see how well the file size works as a partition - and it turns out to do a good job.

Of the 16,073 files, there are 14652 different sizes; of these, 1400 sizes have 2 matching files while 10 sizes have 3 files, and the remainder have only a single file.

And it turns out that all 1410 of those are genuine duplicates - i.e. there are no cases of files which have the same size without actually being the same; therefore size is a very effective discriminator for photo files. Even better - running this simplified algorithm on my 16,000 sample takes about 20 seconds on my aging Macbook Pro. So I can indeed eliminate all those extra features in steps 2a, 4 and 5.

Part 3 of this series will describe what I did for step 6 above - i.e. how to present this data to the user, how to make it easy to eliminate any duplicates found and how to not make it easy to inadvertently delete files you shouldn't.

Part 4 will (probably) describe an app for removing uninteresting photos.

And Part 5 will (perhaps) describe whether or how I found it necessary to improve the image viewer app described in part 1. The increase in average file size from 0.5 Mb to 24 Mb means that the time to transition from one photo to the next has gone from "feels instant" to "hmmm, feels fairly quick". I'll decide from using the app regularly over the next week or two whether "fairly quick" is good enough, or whether it's worth implementing pre-caching for the adjacent photo(s) to get back the "instant" feel.

-- Alex.
P.S. I *will* get these apps cleaned up and onto revOnline. Soon. Any day now. RSN. Promise :-)

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to