On 30 Nov 2013, at 10:37, Graham Cox <graham....@bigpond.com> wrote:

> 
> On 30 Nov 2013, at 12:41 am, Kyle Sluder <k...@ksluder.com> wrote:
> 
>> That’s not a good idea. If the file on disk changes out from under you, the 
>> cache no longer reflects the data.
> 
> That can’t happen - the image data is embedded in the file stream at the 
> moment, not held as a separate file. While I intend to change to that at some 
> point, I don’t do it now. When I do, I’ll also (have to) change the way that 
> the image references in the main stream are written. So the hash approach is 
> only really a stop-gap - it gets me better performance without changing the 
> basic architecture.
> 
> @Scott Ribe:
> 
>> I recently (last week, no kidding) investigated this question, and went with 
>> murmur hash.
> 
> 
> 
> I went with a 64-bit FNV-1a, which looks on a par with murmur, according to 
> the page linked by Eric: 
> http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed.
>  Which admittedly I am taking at face value, having only profiled my own code 
> for speed, not randomness, etc. The main reason being that an implementation 
> of FNV-1a was right there on the page :) 


Hi Graham,

Anything short of a cryptographic hash is unsuitable for you here. You’re 
living dangerously, asking for trouble when a collision does happen.

As it happens, we have almost exactly the same scenario as you in Sandvox. We 
store documents as a file package, where each piece of media is kept on disk, 
and we want to avoid unnecessary duplication of that media. We experimented 
with hashing to help with this in the past, but in practice it wasn’t worth it.

Instead now, each time the document is saved, for any new media we scan through 
the existing media files looking for any that are the same length. For any that 
are we then properly compare them for equality, and act upon that.

This sounds a pretty poor algorithm, but in practice it works well. The chances 
of two files having exactly the same length is actually pretty slim. If you 
consider this a simple hash, it has the nice property that as equality testing 
gets more expensive — i.e. the files in question get larger — collisions get 
ever less likely.

Similarly, when doing the actual equality test, this is performed in chunks as 
we read off disk, rather than processing the whole file at once (NSFileManager 
has a nice method for doing this with two files, and we have our own variants 
to handle one or both files being in-memory), so we can bail out of the check 
as soon as a mismatch occurs. Again, this has the nice property that the 
further through the file you get, the more likely they are to be equal.

If two files turn out to be equal, then we’ve done the work of reading both of 
them off disk (that’s the worst case of neither being in-memory), which seems 
wasteful. But isn’t if you consider that in the case of no testing being done 
at all, the file would still have to be read once in its entirety, and written 
to the new location. So they both work out roughly the same amount of work. And 
if you consider the points above, in the event of a mismatch, the odds are 
highly in our favour that this will be discovered without having done very much 
work.

All in all, it’s working pretty well for us. We’ve since adopted background 
document saving which makes the process even smoother, but even on 10.6 it 
works well.

Mike.
_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to