"Dr.Ruud" <rvtol+use...@isolution.nl> writes:

> On 12/06/2013 11:33, lee wrote:
>> Jim Gibson <jimsgib...@gmail.com> writes:
>>> On Jun 11, 2013, at 9:44 PM, lee wrote:
>
>>>> I've been googling for examples of how to create a sha-2 sum of a
>>>> file in perl without success.  What I'm looking for is something
>>>> like:
>>>>
>>>>    $hash = create_sha2_sum( $filename);
>>>>
>>>> Do you know of any examples I could look at?  Or is there a better
>>>> way to figure out if a file has been modified?
>>>
>>> The first thing to do would be to check the file size. If the file
>>> size has changed, then the file has been modified. So you will want to
>>> save the file size.
>>
>> The file might be modified without changing its size ...
>>
>>> If the files sizes are the same, then you can compare some sort of
>>> digest e.g. SHA. I haven't used any, so I cannot advise.
>>
>> ... so I'm better off by just using a hash which I'd need anyway.
>
> No. If the file is real big, then calculating the hash (of the new
> file) can take a long time. Which would be superfluous if the file
> size also has changed.
>
> I store: file size, fingerprint of first 256 bytes, fingerprint of
> total file. So only if both the size and the light fingerprint are the
> same, I need to check the full fingerprint.

Oh now I see your point: You are trying to avoid having to compute hashs
for large files when this isn't needed, so you get much better
efficiency by checking other information first.


In my application, my estimate is that there will be a set of around
100--150 files.  Once a file is closed and reported one last time, it
doesn't need to be considered anymore, so the number of relevant files
is limited.  Each file is only about 2kB in size.  Reports will be
generated only monthly.

Considering this, it seems doubtful that the additional effort in
programming and procedure required to handle exceptions in which not
/both/ mtime and size have changed compared to simply go by hash only is
worthwhile the benefit in performance: The effective difference in this
case is probably like the difference between "(almost) instantly" and
"about 3 seconds".


OTOH, it's nicer to make it so that file size doesn't have a major
impact on performance because the solution would be more versatile.
Unfortunately, creating a hash only over (random) parts of the files
won't suffice because a different part of the file might have changed
than the one sampled.  I don't want the handling of exceptions to
require manual intervention, either.  This means that I can't get around
saving hashs for whole files.  I could only save computing hashs for
those files that still have the same size /and/ same mtime they had a
month ago.

Having that said, I do like this idea.  Hashs would need to be computed
during report generation only when size and mtime indicate that a file
might have changed, instead of computing them all every time just to see
if a file did change.  I think I'll probably go for that.


-- 
"Object-oriented programming languages aren't completely convinced that
you should be allowed to do anything with functions."
http://www.joelonsoftware.com/items/2006/08/01.html

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/


Reply via email to