Update to Part 2.

Part 2 of this series described a scheme for collecting the info about duplicate photos. This included some justification for not bothering to collect or store md5Hash info for each (relevant) file - and it turned out that my investigative / benchmark code was buggy. There were two bugs - one was just a simple bug of mine, which led to much shorter apparent times for some operations, while the other was misleading times due to a significant improvement in LC8 compared to earlier versions.

After correcting these two bugs of mine, the justification became much weaker - but the conclusion remained unchanged.


Part 3 - how to present this data to the user.

There are many possibilities, from the simplest (long list of duplicate files path names - simple but unusable !!), to the complex (coloured context diffs of the directory structures).

If LC had a built-in "accordian" view or tree view with expanding/contracting disclosure triangles, I'd probably use that.

However, it doesn't (yet), so I decided to stick with the really simple. But before describing it, let's talk a little bit about how/why all these duplicate photos have arisen in the first place.

1. You import a bunch of photos from a camera, choose NOT to delete them from the camera, and then later re-import the same photos (possibly along with some additional ones that have been taken in the meantime).

1a. if you do this into the same folder (e.g. same import program, same settings), you probably get a message asking if you want to "replace, skip or keep both" (or something like that). Of course, you choose "keep both" :-) - and now you have a set of photos called e.g. IMGP0021 and IMGP0021-1, etc. Note you can get this same effect by importing the photos twice, on different laptops - and subsequently copying / merging the folders between them.

1b. if you do this into different folders (e.g. once with say Picasa, once with Lightroom - with their differing default naming schemes), you might finish up with two folders (say "2015-06-01" and "01 June 2016"). And again these could be exact duplicates, or the latter could contain additional photos taken later. Or one or other folder could have had some 'sidecar' files added (e.g. Thumbs.db for some viewing apps, etc).

2. You want to copy some of the photos from a folder to another machine. First you copy the whole folder, then delete some files you know you don't need - then you can put this tidied folder onto a USB stack, or use an FTP app, or .... and in the end, you forget to remove this temp copy.

3. Almost any other thing you can imagine ...


So I decided to keep the output very simple, and only try to distinguish two cases (1a and all the rest). See below for the gory details, if you want :-)

This gave me a rather long, but usable set of output descriptions. It then took me less than an hour of slightly tedious work (one IDE window, one terminal/shell window and two Finder windows ...) to eliminate roughly 20 fully duplicated directories, plus an additional 20-30 partially duplicated ones and a lot of file-name duplicated ones (i.e. case 1a above) - removing just over 10% of the total number of files and disk space in use.


Design choices.
If I had been developing an app to market (either for sale, or as a freebie for people to just *use*), it would have been an easy choice to spend an extra day programming, provide some in-built display and deletion options - and reduce the "tedious one hour" to a "mildly annoying 10 minutes" of time to remove the files. On the other hand, if this had been just a tool for myself to find and eliminate these 4-5000 files, I'd have stopped programming a day or two earlier and spent two tedious hours doing "rm *-2.jpg" etc., and achieved the same end result. But because I wanted to make the stack available for others, and I wanted to think about and write about the choices I make along the way - but not make it anything like a polished app - this middle-path feels right. But the point is that there is no single *right* answer to the question of how much development time is justified until you are clear on your target and purpose for doing the development in the first place.

Now on to the interesting challenge of finding and dealing with "non-interesting" photos ...
 -- Alex.

P.S. the gory details of the output format I chose in the end.

The first case is handled by reporting

Folder <foldername> has N files, with M self-matches, followed y a paried-lisr of matching files

(i.e. M cases of duplicated files (directly) within the same folder out of N total files).

e.g.

Folder /Users/alextweedly/Dropbox (Personal)/Pictures/2014/2014-06-18

has 38 files, and 18 self matches

IMGP0128-2.JPG IMGP0128.JPG

IMGP0118-2.JPG IMGP0118.JPG

IMGP0129-2.JPG IMGP0129.JPG

 .....

IMGP0120-2.JPG IMGP0120.JPG


All other cases report as

Folders <foldername1> and <foldername2> have N1 and N2 files with NN matches, followed by a list of matches.

Generally this will result in something like

Folders /Users/alextweedly/Dropbox (Personal)/Pictures/2011/2011-09-13 /Users/alextweedly/Dropbox (Personal)/Pictures/2011/9 Oct 2011

Have 9 54 files, and 9 matches

IMG_0236.JPG IMG_0236.JPG

IMG_0237.JPG IMG_0237.JPG

....


In this particular example, the names all match, there are 9 matches for the 9 files in "folder1" - so as a user, you then need to decide whether

(a) you should just delete folder1 and all its contents

(b) there are other duplicated folders/files (e.g. 2011-09-12, 2011-09-10, etc.) which between them contain all the other files from the 54 in "9 Oct 2011" and that's the one you should completely delete.

On fact, that's what I did. I'm sticking with Lightroom rather than the other photo management apps - so the 2011-09-13 style of folder naming is the one I 'd prefer to use.




_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to