> -----Original Message----- > From: Andone, Bogdan [mailto:bogdan.and...@intel.com] > Sent: Wednesday, March 09, 2016 9:33 AM > To: 'Nikita Popov' > Cc: internals@lists.php.net > Subject: RE: [PHP-DEV] Lazy keys comparison during hash lookups > > > > > -----Original Message----- > > From: Andone, Bogdan [mailto:bogdan.and...@intel.com] > > Sent: Tuesday, March 08, 2016 4:19 PM > > To: 'Nikita Popov' > > Cc: internals@lists.php.net > > Subject: RE: [PHP-DEV] Lazy keys comparison during hash lookups > > > > > From: Nikita Popov [mailto:nikita....@gmail.com] > > > Sent: Tuesday, March 08, 2016 3:43 PM > > > To: Andone, Bogdan > > > Cc: internals@lists.php.net > > > Subject: Re: [PHP-DEV] Lazy keys comparison during hash lookups > > > >On Tue, Mar 8, 2016 at 2:18 PM, Nikita Popov <nikita....@gmail.com> > > wrote: > > > >> On Tue, Mar 8, 2016 at 2:01 PM, Andone, Bogdan > > > <bogdan.and...@intel.com> wrote: > > > >> Hi Guys, > > > >> > > > >> I would like to propose a small change into the DJBX33A hash > > function > > > algorithm which will make easier the key matching validations in > hash > > > lookup functions. > > > >> > > > >> The change addresses the modulo 8 tailing bytes of the key. For > > these > > > bytes we can use an 8 bit shift instead of a 5 bit shift; we also > need > > > to replace the ADD by XOR, in order to avoid byte level overflows. > > This > > > change ensures the uniqueness of the hash function transformation > for > > > the tailing bytes: supposing two strings have same partial hash > value > > > for the first Nx8 bytes, different combinations of tailing > characters > > > (with the same tail size) will always generate different keys. > > > >> We have the following consequences: > > > >> If two strings have: > > > >> - same hash value, > > > >> - same length, > > > >> - same bytes for the first Nx8 positions, > > > >> then they are equal, and the tailing bytes can be skipped during > > > comparison. > > > >> > > > >> There is a visible performance gain if we apply this approach as > we > > > can use a lightweight memcmp() implementation based on longs > > comparison > > > and completely free of the complexity incurred by tailing bytes. For > > > Mediawiki I have a 1.7% performance gain while Wordpress reports > 1.2% > > > speedup on Haswell-EP. > > > >> > > > >> Let’s take a small example: > > > >> Suppose we have a key=”this_is_a_key_value”. > > > >> The hash function for the first N x 8 byes are computed in the > > > original way; suppose “this_is_a_key_va” (16bytes) will return a > > partial > > > hash value h1; the final hash value will be computed by the > following > > > sequence: > > > >> h = ((h1<<8) ^ h1) ^ ‘l’; > > > >> h = ((h<<8) ^ h) ^ ‘u’; > > > >> h = ((h<<8) ^ h) ^ ‘e’; > > > >> or, in only one operation: > > > >> h = (h1<<24) ^ (h1<<16) ^ (h1<<8) ^ h1 ^ (‘l’<<16) ^ > ((‘l’^‘u’)<<8) > > ^ > > > (‘l’^’u’^‘e’) > > > >> We can see that ht=(‘l’<<16) ^ ((‘l’^‘u’)<<8) ^ > > (‘l’^’u’^‘e’) cannot > > > be obtained by any other 3 characters long tail. The statement is > not > > > true if we use ADD instead of XOR, as extended ASCII characters > might > > > generate overflows affecting the LSB of the higher byte in the hash > > > value. > > > >> > > > >> I pushed a pull request here: https://github.com/php/php- > > > src/pull/1793. Unfortunately it does not pass the travis tests > because > > > “htmlspecialchars etc use a generated table that assumes the current > > > hash function” as noticed by Nikita. > > > >> > > > >> Let me know your thoughts on this idea. > > > > > > > > Hey Bogdan, > > > > This looks like an interesting idea! I'm somewhat apprehensive > about > > > coupling this to a change of the hash function, for two reasons: > > > > a) This will make it more problematic if we want to change the > hash > > > function in the future, e.g. if we want to switch to SipHash. > > > > b) The quality of the new hash distribution is not immediately > > clear, > > > but likely non-trivially weaker. > > > > So I'm wondering if we can keep the concept of using a zend_ulong > > > aligned memcmp while leaving the hash function alone: The > zend_string > > > allocation policy already allocates the string data aligned and > padded > > > to zend_ulong boundaries. If we were to additionally explicitly zero > > out > > > the last byte (to avoid valgrind warnings) we should be able to > > compare > > > the character data of two zend_strings using a zend_ulong memcmp. > This > > > would have the additional benefit that it works for normal string > > > comparisons (unrelated to hashtables) as well. On the other hand, > this > > > is only possible for zend_string to zend_string comparisons, not for > > > comparisons with static strings. > > > > Regards, > > > s/zero out the last byte/zero out the last zend_ulong > > > I'd like to add another issue with relying on the hash for this > which > > I > > > just remembered: We currently always set the top bit of the hash for > > > strings (see > > http://lxr.php.net/xref/PHP_TRUNK/Zend/zend_string.h#351), > > > in order to ensure that hashes are never zero. This makes the hash > > non- > > > unique. > > > Nikita > > Yeah... I missed the top bit set, but I think it could be solved > > somehow. > Actually setting the top bit is not an issue. The transformation unicity > shall be ensured for the tailing bytes only which are shifted to the > left maximum 7(on 64bits)/3(on 32bits) positions; so they will never > reach the top byte and setting the top bit will not harm in terms of > tailing bytes. > > > Imho the strongest reason against the patch is the dependency > > between the hash function and the key validation method but I have the > > impression that this is not be the most dirty hack deployed ever in > PHP; > > and I don't know very well what is the degree of compromise acceptable > > for 1% speedup :-) > Otherwise the concerns about the quality of the hash function and the > intention to change it with SipHash or something else, I am afraid it > will not bring much in terms of collisions reductions; the issue here is > the fact that hash bucket selection uses only the last few bits of the > hash value. Not sure how much could help here SipHash... > Just for defending the technical feasibility of this POC I think that > injective transformation for last tailing bytes could be deployed over > theoretically any hash function. > > Thanks, > Bogdan > > > On the other hand, your idea to zero the last zend_ulong in the > > zend_string looks nice. There could be an additional potential gain > also > > from lightweight implementation of memcmp(). I will give it a try to > see > > if it gets better results than the current proposal. > > I assume all zend_string allocation functions are located in > > zend_string.h, correct? > > > > Thanks, > > Bogdan I worked a little bit of adapting the zend string allocation functions for ensuring long size alignment for both Zend and standard allocation, and for zeroing out the end of the strings. The sad surprise was to discover that there are other >100 situations, in extensions mainly, where the zend strings are truncated by directly inserting '0' terminators in the middle of strings; all these situations needs to be revised for zeroing out the entire memory word enclosing the end of the string. It will probably take a while for revising most of these places before having a correct real life run for being able to validate the POC. Optimistically speaking, supposing POC will show motivating results, we will probably need to enforce the Zend interface with respect to string truncation - at least a getter/setter pair for ZSTR_LEN where the setter should ensure also synchronizing the string termination. I suppose it will be an important effort for the community and I am not sure how much willingness will be in it. On the other side, the originally proposed solution does not have all this drawbacks and it offers 1% speedup by a localized change which could be easily reverted in the case we will get better results by a better string alignment in future.
Thanks, Bogdan