On 11/04/2025 15:49, Olivier Dion wrote:
My goal here is that I have GOOPS object. The object is used to produce
a pure result which I can store in a cache on the disk, given the hash
of the object. I can then re-fetch the result on disk in another Guile
process if the hashes match. As you can see in the above code, GOOPS
instances get hashed by folding over their slots, which can include
procedures.
The usual solution to this kind of thing, is to:
(1) hash procedures by pointer (not consistent across processes, not
applicable for your case) (2) don't hash procedures (3) Allow classes to
override how hashing is performed. Instead of (define (hash value)
[...]), you could have (define-method (hash value) [...]). (Also be sure
to define equality in a way such that 'a = b -> hash(a)=hash(b)' -
doesn't have to be '=' or 'equal?', but it does need to be the equality
procedure used for hash table things.) (Also consider (hash value n)
instead, so implementers have a clue for reasonable output size.) (3)(a)
if a field is a thunk merely as a way to do laziness (and forcing the
lazy isn't expected to be computationally expensive), force the lazy and
hash the result (3)(b) if the procedure field is irrelevant to the
considered problem, don't hash it (and adjust your expectations to make
it not an element of surprise) (3)(c) if there is a limited set of
procedures it could be, do something like hashing by name (specifics
depend on specific situation) (3)(d) if the procedure field is relevant,
and there is no apparent way to hash something else instead, change this
situation (4) if nothing appears applicable for the problem at hand,
change the problem.
If you do go for bytecode hashing, you could consider looking at the
closure and hashing things in there as well, to reduce hash collisions.
I can then re-fetch the result on disk in another Guile process if the hashes
match.
You can't (at least not with the mentioned hash), because of likely hash
collisions. If the 'hash' function doesn't have collisions, then either
the input space is small (finite) and hence the 'hash' isn't
general-purpose (not always an issue, but in your case it seems like it
would be), or the output space is infinite, in which case it has lost
its function as a hash.
Typical hashing of the non-cryptographic kind aren't designed to
virtually eliminate hash collisions, rather they are designed to be
'good enough', and hash collisions are not expected to be eliminated,
instead they are accounted for in some way. In your case, the disk acts
as a hash table, so you could make a bucket list (so you would need to
also save the unhashed _keys_ instead of only their hash - hashes then
aren't to identify things on their own, but rather to speed things up a
lot).
Best regards, Maxime Devos