Re: hash values and comparisons of strings

Nicholas Clark Fri, 24 May 2002 15:16:12 -0700

On Fri, May 24, 2002 at 12:43:00AM +0200, Peter Gibbs wrote:

Reformatted slightly as "X-Mailer: Microsoft Outlook Express 5.50.4133.2400"
seems to like re-wrapping your hardwrapped lines.


> Reading this made me wonder if we should consider cached string
> transcodings, if we don't end up storing strings in a single form
> internally. The worst case is probably string constants, which could be
> transcoded over and over again into the same alternate encoding. As an
> extension of this, the hashing algorithm could be deemed to be another
> encoding i.e. the hashed value of a string could also be cached. Trying to
> decide when a cache entry was no longer needed could be a little bit tricky,
> but it might be worth giving some thought to.

I suspect that string constants could be quite important.
Being able to cache each transcodings as it got needed could speed things up.
It would be good to have all references to a constant still point to one
place, so that the transcoding only had to be done once and all benefit.
I don't know if threading screws this idea up. (because you'd have to lock
the shared constant every time anything reads it, to stop anyone else making
a new transcoding appear just as you read it)

Nick I-S recoded parts of perl5 so that constants like the bar in $foo{bar}
would be stored as scalars which both point as entries into the shared string
table, and contain their hash value pre-computed.

He did see a speedup in his heavy OO program (in Tk, I suspect) of a few
percent, but the infamous perlbench can't be goaded into showing any sort of
speed up or slowdown on this or various of my hash key experiments.

I believe having the scalars use a pointer to shared hash key (rather than
a private malloc()ed buffer) was the bigger win, as memcmp on the keys
becomes 
  left == right || memcmp(left, right, length) == 0
and hopefully the pointer comparison hits often.

However, pre-computed hash keys may have helped a bit in perl5.

There are probably several ideas to take out of my ramble:

1: If it can be arranged for parrot to have constants in some shared pool,
   and better still things are copy-on-write from it, then C<eq> can be
   accelerated by comparing pointers to things (if one isn't a substring)

2: If scalars are able to cache their hash value, then C<eq> for 2 scalars
   with cached hash values can be accelerated by first comparing the
   hash values, and rejecting if different.
   This is only going to work on good old fashioned binary comparisons,
   or if hash values are calculated by transcoding and normalising to some
   form that considers things equivalent in the same way that C<eq> should.

3: If it can be arranged for hash keys to become cached in scalars (even if
   the transcoding of the string into whatever encoding the hash keys are
   stored in is no longer cached) then it provides a quick reject mechanism
   when looking to see if that string is in a hash - if the cached hash value
   doesn't match an the hash value of any keys in the target hash, then you
   know it's not in there, and you don't need to transcode the string.


I confess I don't have an understanding of how and when the innards of the
parrot string system does transcoding, or how C<eq> and hash keys are going
to deal with Unicode normal forms, so there may be some flaws in the above
which are obvious to anyone who does understand these things.
I could also have stupid mistakes in my reasoning.

Nicholas Clark
-- 
Even better than the real thing:        http://nms-cgi.sourceforge.net/

Re: hash values and comparisons of strings

Reply via email to