RE: [PHP-DEV] Lazy keys comparison during hash lookups

Andone, Bogdan Fri, 29 Apr 2016 02:55:27 -0700


> -----Original Message-----
> From: Matt Wilmas [mailto:php_li...@realplain.com]
> Sent: Wednesday, April 27, 2016 5:44 PM
> To: Andone, Bogdan <bogdan.and...@intel.com>; 'Nikita Popov'
> <nikita....@gmail.com>
> Cc: internals@lists.php.net
> Subject: Re: [PHP-DEV] Lazy keys comparison during hash lookups
> 
> Hi Bogdan, all,
> 
> ----- Original Message -----
> From: "Andone, Bogdan"
> Sent: Monday, April 25, 2016
> 
> >> -----Original Message-----
> >> From: Andone, Bogdan [mailto:bogdan.and...@intel.com]
> >> Sent: Friday, March 18, 2016 11:58 AM
> >> To: 'Nikita Popov' <nikita....@gmail.com>
> >> Cc: internals@lists.php.net
> >> Subject: RE: [PHP-DEV] Lazy keys comparison during hash lookups
> >>
> >> > -----Original Message-----
> >> > From: Andone, Bogdan [mailto:bogdan.and...@intel.com]
> >> > Sent: Wednesday, March 09, 2016 9:33 AM
> >> > To: 'Nikita Popov'
> >> > Cc: internals@lists.php.net
> >> > Subject: RE: [PHP-DEV] Lazy keys comparison during hash lookups
> >> >
> >> >
> >> >
> >> > > -----Original Message-----
> >> > > From: Andone, Bogdan [mailto:bogdan.and...@intel.com]
> >> > > Sent: Tuesday, March 08, 2016 4:19 PM
> >> > > To: 'Nikita Popov'
> >> > > Cc: internals@lists.php.net
> >> > > Subject: RE: [PHP-DEV] Lazy keys comparison during hash lookups
> >> > >
> >> > > > From: Nikita Popov [mailto:nikita....@gmail.com]
> >> > > > Sent: Tuesday, March 08, 2016 3:43 PM
> >> > > > To: Andone, Bogdan
> >> > > > Cc: internals@lists.php.net
> >> > > > Subject: Re: [PHP-DEV] Lazy keys comparison during hash lookups
> >> > > > >On Tue, Mar 8, 2016 at 2:18 PM, Nikita Popov
> >> > > > ><nikita....@gmail.com>
> >> > > wrote:
> >> > > > >> On Tue, Mar 8, 2016 at 2:01 PM, Andone, Bogdan
> >> > > > <bogdan.and...@intel.com> wrote:
> >> > > > >> Hi Guys,
> >> > > > >>
> >> > > > >> I would like to propose a small change into the DJBX33A hash
> >> > > function
> >> > > > algorithm which will make easier the key matching validations
> >> > > > in
> >> > hash
> >> > > > lookup functions.
> >> > > > >>
> >> > > > >> The change addresses the modulo 8 tailing bytes of the key.
> >> > > > >> For
> >> > > these
> >> > > > bytes we can use an 8 bit shift instead of a 5 bit shift; we
> >> > > > also
> >> > need
> >> > > > to replace the ADD by XOR, in order to avoid byte level overflows.
> >> > > This
> >> > > > change ensures the uniqueness of the hash function
> >> > > > transformation
> >> > for
> >> > > > the tailing bytes: supposing two strings have same partial hash
> >> > value
> >> > > > for the first Nx8 bytes, different combinations of tailing
> >> > characters
> >> > > > (with the same tail size) will always generate different keys.
> >> > > > >> We have the following consequences:
> >> > > > >> If two strings have:
> >> > > > >> - same hash value,
> >> > > > >> - same length,
> >> > > > >> - same bytes for the first Nx8 positions, then they are
> >> > > > >> equal, and the tailing bytes can be skipped during
> >> > > > comparison.
> >> > > > >>
> >> > > > >> There is a visible performance gain if we apply this
> >> > > > >> approach as
> >> > we
> >> > > > can use a lightweight memcmp() implementation based on longs
> >> > > comparison
> >> > > > and completely free of the complexity incurred by tailing bytes.
> >> > > > For Mediawiki I have a 1.7% performance gain while Wordpress
> >> > > > reports
> >> > 1.2%
> >> > > > speedup on Haswell-EP.
> >> > > > >>
> >> > > > >> Let's take a small example:
> >> > > > >> Suppose we have a key="this_is_a_key_value".
> >> > > > >> The hash function for the first N x 8 byes are computed in
> >> > > > >> the
> >> > > > original way; suppose "this_is_a_key_va" (16bytes) will return
> >> > > > a
> >> > > partial
> >> > > > hash value h1; the final hash value will be computed by the
> >> > following
> >> > > > sequence:
> >> > > > >> h = ((h1<<8) ^ h1) ^ 'l';
> >> > > > >> h = ((h<<8) ^ h) ^ 'u';
> >> > > > >> h = ((h<<8) ^ h) ^ 'e';
> >> > > > >> or, in only one operation:
> >> > > > >> h = (h1<<24) ^ (h1<<16) ^ (h1<<8) ^ h1 ^ ('l'<<16) ^
> >> > (('l'^'u')<<8)
> >> > > ^
> >> > > > ('l'^'u'^'e')
> >> > > > >> We can see that ht=('l'<<16) ^ (('l'^'u')<<8) ^
> >> > > ('l'^'u'^'e') cannot
> >> > > > be obtained by any other 3 characters long tail. The statement
> >> > > > is
> >> > not
> >> > > > true if we use ADD instead of XOR, as extended ASCII characters
> >> > might
> >> > > > generate overflows affecting the LSB of the higher byte in the
> >> > > > hash value.
> >> > > > >>
> >> > > > >> I pushed a pull request here: https://github.com/php/php-
> >> > > > src/pull/1793. Unfortunately it does not pass the travis tests
> >> > because
> >> > > > "htmlspecialchars etc use a generated table that assumes the
> >> > > > current hash function" as noticed by Nikita.
> >> > > > >>
> >> > > > >> Let me know your thoughts on this idea.
> >> > > > >
> >> > > > > Hey Bogdan,
> >> > > > > This looks like an interesting idea! I'm somewhat
> >> > > > > apprehensive
> >> > about
> >> > > > coupling this to a change of the hash function, for two reasons:
> >> > > > > a) This will make it more problematic if we want to change
> >> > > > > the
> >> > hash
> >> > > > function in the future, e.g. if we want to switch to SipHash.
> >> > > > > b) The quality of the new hash distribution is not
> >> > > > > immediately
> >> > > clear,
> >> > > > but likely non-trivially weaker.
> >> > > > > So I'm wondering if we can keep the concept of using a
> >> > > > > zend_ulong
> >> > > > aligned memcmp while leaving the hash function alone: The
> >> > zend_string
> >> > > > allocation policy already allocates the string data aligned and
> >> > padded
> >> > > > to zend_ulong boundaries. If we were to additionally explicitly
> >> > > > zero
> >> > > out
> >> > > > the last byte (to avoid valgrind warnings) we should be able to
> >> > > compare
> >> > > > the character data of two zend_strings using a zend_ulong memcmp.
> >> > This
> >> > > > would have the additional benefit that it works for normal
> >> > > > string comparisons (unrelated to hashtables) as well. On the
> >> > > > other hand,
> >> > this
> >> > > > is only possible for zend_string to zend_string comparisons,
> >> > > > not for comparisons with static strings.
> >> > > > > Regards,
> >> > > > s/zero out the last byte/zero out the last zend_ulong I'd like
> >> > > > to add another issue with relying on the hash for this
> >> > which
> >> > > I
> >> > > > just remembered: We currently always set the top bit of the
> >> > > > hash for strings (see
> >> > > http://lxr.php.net/xref/PHP_TRUNK/Zend/zend_string.h#351),
> >> > > > in order to ensure that hashes are never zero. This makes the
> >> > > > hash
> >> > > non-
> >> > > > unique.
> >> > > > Nikita
> >> > > Yeah... I missed the top bit set, but I think it could be solved
> >> > > somehow.
> >> > Actually setting the top bit is not an issue. The transformation
> >> > unicity shall be ensured for the tailing bytes only which are
> >> > shifted to the left maximum 7(on 64bits)/3(on 32bits) positions; so
> >> > they will never reach the top byte and setting the top bit will not
> >> > harm in terms of tailing bytes.
> >> >
> >> > > Imho the strongest reason against the patch is the dependency
> >> > > between the hash function and the key validation method but I
> >> > > have the impression that this is not be the most dirty hack
> >> > > deployed ever in
> >> > PHP;
> >> > > and I don't know very well what is the degree of compromise
> >> > > acceptable for 1% speedup :-)
> >> > Otherwise the concerns about the quality of the hash function and
> >> > the intention to change it with SipHash or something else, I am
> >> > afraid it will not bring much in terms of collisions reductions;
> >> > the issue here is the fact that hash bucket selection uses only the
> >> > last few bits of the hash value. Not sure how much could help here
> SipHash...
> >> > Just for defending the technical feasibility of this POC I think
> >> > that injective transformation for last tailing bytes could be
> >> > deployed over theoretically any hash function.
> >> >
> >> > Thanks,
> >> > Bogdan
> >> >
> >> > > On the other hand, your idea to zero the last zend_ulong in the
> >> > > zend_string looks nice. There could be an additional potential
> >> > > gain
> >> > also
> >> > > from lightweight implementation of memcmp(). I will give it a try
> >> > > to
> >> > see
> >> > > if it gets better results than the current proposal.
> >> > > I assume all zend_string allocation functions are located in
> >> > > zend_string.h, correct?
> >> > >
> >> > > Thanks,
> >> > > Bogdan
> >> I worked a little bit of adapting the zend string allocation
> >> functions for ensuring long size alignment for both Zend and standard
> >> allocation, and for zeroing out the end of the strings.
> >> The sad surprise was to discover that there are other >100
> >> situations, in extensions mainly, where the zend strings are
> >> truncated by directly inserting '0'
> >> terminators in the middle of strings; all these situations needs to
> >> be revised for zeroing out the entire memory word enclosing the end
> >> of the string.
> >> It will probably take a while for revising most of these places
> >> before having a correct real life run for being able to validate the
> >> POC.
> >> Optimistically speaking, supposing POC will show motivating results,
> >> we will probably need to enforce the Zend interface with respect to
> >> string truncation - at least a getter/setter pair for ZSTR_LEN where
> >> the setter should ensure also synchronizing the string termination.
> >> I suppose it will be an important effort for the community and I am
> >> not sure how much willingness will be in it. On the other side, the
> >> originally proposed solution does not have all this drawbacks and it
> >> offers 1% speedup by a localized change which could be easily
> >> reverted in the case we will get better results by a better string
> >> alignment in future.
> >>
> >> Thanks,
> >> Bogdan
> > I finalized the POC for ensuring strings granularity to zend_long
> > size; you can find it here:
> > https://gist.github.com/bogdanandone/cfde764f64137b57ce2a6f4a9b52e757
> > Unfortunately, the effort to zero-out the end of strings generates an
> > overhead higher than the gain incurred by simplified
> > zend_memcpy_aligned
> > memcpy() & memcmp() implementations - I defined a simpler
> > zend_memcpy_aligned() function used in zend_string.h and
> > zend_ulong_memeq() used in zend_hash.c inherited from my original place.
> > Overall I see ~1% loss on Worpress on my platform. I didn't spend time
> > in fine tuning each place needing changes but I don't expect a real
> > difference.
> > Probably this is not the way to go and I finalize with a last buzz for
> > the original patch proposal :)
> 
> I thought, "Oh crap" when I saw your message, after you went to all that
> trouble!  I think there's a simple way to check the trailing bytes, but I 
> thought
> Valgrind would complain about it.  However, just about 18 hours before your
> mail, I found out that's not the case, so it seems fine. :-O I'll come back 
> to this
> last...
> 
> First, I wanted to thank you for the original idea -- because that led me to 
> look
> into PHP 7 hash stuff because I thought I wasn't familiar with enough of it.
> Turned out I knew more than I realized, but I hope now my knowledge is much
> more thorough...  Just because of your e-mail. :^)
> 
> At first, I thought of, "Hey, SSE compares!" (starting with length, into the 
> string,
> at once).  Ignoring even more alignment/trailing issues, I tested standalone
> (using 16-1xx chars, Haswell).  But _mm_empeq_epi8 +
> _mm_movemask_epi8 were exactly the same as 64-bit scalar (don't think I tried
> ints/-m32).  The pmovmskb latency I guess kills any gain from checking 
> 16-at-a-
> time.
> 
> I guess for longer strings, could do 32 (64?) byte chunks, with 2 (4?) 
> cmpeq's +
> por + movemask to get closer to what memcmp() would do...?
> 
> Yeah, checking memcmp() standalone, the call overhead is high for common,
> short strings.  I'm guessing that's where most of the 1% came from?
Your feeling is right. My solution targets scenarios with predominant short 
keys. 
What I observed on real life workloads like Wordpress is that hash table keys 
are usually short strings: I did a profiling on WP one year ago and if I 
remember correctly, ~%80 of keys have sizes < 32bytes; few isolated cases of 
keys with sizes >200bytes.
For short strings the time spent in memcmp() in dealing with trailing bytes is 
much higher than the two-three loops for copying 8bytes chunks. This has also 
consequences at branch prediction level as trailing bytes involves more 
conditional branches which are more miss-predicted that loop related branches.
My previous impression was that these facts above explain the 1% gain but, 
after the painful experiment of trying to round up the string sizes, I should 
say that part of the gain could be explained also by the fact that lazy strings 
involve less data accesses with some reduction of data misses.
My initial thought was also to enable SSE based implementation but, as you also 
observed, it does not worth on short strings.


> 
> 
> Then, after learning about some other hash stuff, I came up with some ideas
> for a *new* HashTable implementation in PHP! :^)  I hope to start and have a
> working version shortly (see how that works, before further improvements).
> I think I'll send a new message about it first (I don't think anyone else is
> planning any major changes, though).
> 
> 
> Now, about your original "lazy" idea, specifically changing the hash
> function to xor the trailing bytes with the hash value.  I think there's a
> major problem: Consider a "specially" (easily!) crafted string where the
> trailing bytes can clear/set almost all hash bits.  So almost the entire
> hash value can be set by those trailing bytes. :-O  This may not be a
> problem if using a seeded hash function; not sure.
Not sure I understood your point. 
For any string there is always a combination of trailing bytes for 
clearing/setting all hash bytes; this combination will be unique for any 
string. But I don't see any problem if this is not a common situation that will 
generate collisions. Your statement applies also to the current hash function, 
I think.
An example would probably help...

> 
> 
> Finally, the [branchless] way to check trailing bytes instead?  A few weeks
> ago, I thought of using xor to compare, then shift out the unwanted bytes
> (left/right for little/big endian, I guess).  But Valgrind will complain
> about about the uninitialized parts, I thought.  Until I came across some
> info. [1] Sat. night suggesting that Valgrind tracks values on a *bit*
> level.  So I had to try it, outside of PHP, and it worked fine!
> 
> [1]
> https://www.usenix.org/legacy/publications/library/proceedings/usenix05/tech
> /general/full_papers/seward/seward_html/usenix2005.html
> 
> Here's what I thought, which could even be run uncondtionally, without
> checking for trailing bytes, since there's always another long-sized chunk:
> 
> mov (str1), %r
> xor (str2), %r
> and $7, %ecx    # % 8
> shl $3, %ecx    # * 8
> xor %any, %any  # clear flags, for free, in case %cl is 0
> shl %cl, %r
> jne ...
> 
The idea is nice but I think it needs additional tuning.
I assume %ecx keeps the string size, right? If this is true, supposing the 
trailing bytes count is 3, the code will shift out 3 bytes and it will return 
the comparison result for 5 bytes which is not correct; I think more 
instructions will be involved for doing the right shift.
Otherwise, it seems that your proposal works only for the trailing bytes. You 
cannot apply the same steps for the full chunks; in this case also only the 
number of trailing bytes will be compared in each chunk. Not talking about the 
fact that the code should still preserve the chuck index from an iteration to 
another, decrement it and so on. For being short you still have to deal with 
two parts of the implementation: one for entire chunks and one for trailing 
bytes and a conditional logic for jumping from one to another.
 
> Should only cost the latency of the variable shift (slower than AMD :-P).
> 
> But the compilers aren't smart enough (?) to use the shift result for the
> jump, and put an extra "test %r, %r."  Well it will probably fuse with the
> jump, so no extra micro-ops anyway. :-)
> 
> And my "ideal" instructions cause false errors with Valgrind anyway, when
> the shift moves uninitialized bits to the carry flag.  (%eflags is tracked
> as a whole.)  Adding a free "clear carry" doesn't help...
> 
> So I guess a check for trailing bytes is needed in C.  But same idea, with
> the full background. ;-)
Why bothering about carry flag? I assume Valgrind cries because the code 
accesses bytes that are not allocated and that some sanity allocation overhead 
is needed as you say, probably not at the complexity I've tried to deploy. By 
the way, the statement that there's always another long-sized chunk is not true 
for the cases where strings are not allocated with the zend allocator but with 
the system malloc().

Thanks,
Bogdan

> 
> Thoughts?
> 
> > Thanks,
> > Bogdan
> 
> - Matt


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

RE: [PHP-DEV] Lazy keys comparison during hash lookups

Reply via email to