On 30 Nov 2013, at 11:51 pm, Mike Abdullah <mabdul...@karelia.com> wrote:
> Anything short of a cryptographic hash is unsuitable for you here. You’re > living dangerously, asking for trouble when a collision does happen. So, I’ve written a test which scans a chosen folder, hashing each file (all files, not just images) and storing the hashes. If a hash collides, I store it in a list and log it. Scanning my entire hard drive (excluding hidden files), which took several hours, sure I had plenty of collisions - but absolutely no false ones - they all turned out to be genuine duplicates of existing files. This is using the FNV-1a 64-bit hash + length approach. I’m thinking this is good enough, really. The odds of a particular user having two different image files that collide, and happening to add those exact images at once to our app must be astronomically low. Talk me out of it :) Here’s the code, if anyone is interested enough to want to reproduce these results, or spot a flaw in my test: The actual hash, which is part of a category on NSData: - (u_int64_t) fnvHash64 { unsigned char* p = (unsigned char*)[self bytes]; u_int64_t h = 14695981039346656037U; NSUInteger i = 0, len = [self length]; while( i < len ) h = p[i++] ^ (h * 1099511628211U); return h; } The hash class is much as Kyle suggested: @interface GCDataHash : NSObject <NSCopying> { @private u_int64_t mHash; NSUInteger mLength; } @property (nonatomic, readonly) NSUInteger length; - (instancetype) initWithData:(NSData*) data; - (BOOL) isEqual:(id) object; - (NSUInteger) hash; @end @implementation GCDataHash @synthesize length = mLength; - (instancetype) initWithData:(NSData*) data { self = [super init]; if( self ) { mHash = [data fnvHash64]; mLength = [data length]; } return self; } - (NSUInteger) hash { return mHash; } - (BOOL) isEqual:(id) object { if( object == self ) return YES; if(![object isKindOfClass:[GCDataHash class]]) return NO; GCDataHash* other = (GCDataHash*)object; return other->mLength == mLength && other->mHash == mHash; } - (NSString*) description { return [NSString stringWithFormat:@"%@ length=%lu, hash=%lu", [super description], mLength, [self hash]]; } - (id) copyWithZone:(NSZone *)zone { return [self retain]; } This is the test, which expects a directory URL. ivars ‘collisions' is a mutable array, ‘hashes' a mutable dictionary. With as large a data set as I can reasonably muster (which is my entire hard disk), it returns 0 collisions. It's set to go into packages so it would scan all the internals of my iPhoto library, among others. - (void) runTestWithURL:(NSURL*) url { [collisions removeAllObjects]; [hashes removeAllObjects]; NSDirectoryEnumerator* dirEnum = [[NSFileManager defaultManager] enumeratorAtURL:url includingPropertiesForKeys:nil options:NSDirectoryEnumerationSkipsHiddenFiles errorHandler:^BOOL(NSURL *url, NSError *error) { return YES; }]; NSUInteger count = 0; for( NSURL* file in dirEnum ) { @autoreleasepool { NSError* error = nil; NSData* data = [NSData dataWithContentsOfURL:file options:NSDataReadingMappedIfSafe | NSDataReadingUncached error:&error]; if( data ) { GCDataHash* hash = [[GCDataHash alloc] initWithData:data]; // check for hash collision NSURL* existing = [hashes objectForKey:hash]; if( existing ) { // the hash collided. See if the data is actually different, or whether it's just a copy // of an existing file. NSData* otherData = [NSData dataWithContentsOfURL:existing]; if(![otherData isEqualToData:data]) { // we have a bona fide collision - hash matches but data doesn't. Save that. NSDictionary* crunch = @{ hash : file }; [collisions addObject:crunch]; NSLog(@"%lu: collision for path '%@', collided with '%@', hash = %@", collisions.count + 1, file, existing, hash ); } } else [hashes setObject:file forKey:hash]; [hash release]; ++count; } } } NSLog(@"hashes tested: %lu, collided: %lu", count, collisions.count ); } _______________________________________________ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com