On Thu, Apr 23, 2015 at 3:26 PM, Ravi Kerur <rkerur at gmail.com> wrote:
> > > On Thu, Apr 23, 2015 at 7:00 AM, Bruce Richardson < > bruce.richardson at intel.com> wrote: > >> On Thu, Apr 23, 2015 at 06:53:44AM -0700, Ravi Kerur wrote: >> > On Thu, Apr 23, 2015 at 2:23 AM, Ananyev, Konstantin < >> > konstantin.ananyev at intel.com> wrote: >> > >> > > >> > > >> > > > -----Original Message----- >> > > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce >> Richardson >> > > > Sent: Thursday, April 23, 2015 9:12 AM >> > > > To: Wodkowski, PawelX >> > > > Cc: dev at dpdk.org >> > > > Subject: Re: [dpdk-dev] [PATCH] Implement memcmp using AVX/SSE >> instructio >> > > > >> > > > On Thu, Apr 23, 2015 at 09:24:52AM +0200, Pawel Wodkowski wrote: >> > > > > On 2015-04-22 17:33, Ravi Kerur wrote: >> > > > > >+/** >> > > > > >+ * Compare bytes between two locations. The locations must not >> > > overlap. >> > > > > >+ * >> > > > > >+ * @note This is implemented as a macro, so it's address should >> not >> > > be taken >> > > > > >+ * and care is needed as parameter expressions may be evaluated >> > > multiple times. >> > > > > >+ * >> > > > > >+ * @param src_1 >> > > > > >+ * Pointer to the first source of the data. >> > > > > >+ * @param src_2 >> > > > > >+ * Pointer to the second source of the data. >> > > > > >+ * @param n >> > > > > >+ * Number of bytes to compare. >> > > > > >+ * @return >> > > > > >+ * true if equal otherwise false. >> > > > > >+ */ >> > > > > >+static inline bool >> > > > > >+rte_memcmp(const void *src_1, const void *src, >> > > > > >+ size_t n) __attribute__((always_inline)); >> > > > > You are exposing this as public API, so I think you should follow >> > > > > description bellow or not call this _memcmp_ >> > > > > >> > > > > int memcmp(const void *s1, const void *s2, size_t n); >> > > > > >> > > > > The memcmp() function returns an integer less than, equal to, or >> > > greater >> > > > > than >> > > > > zero if the first n bytes of s1 is found, >> respectively, >> > > to be >> > > > > less than, to >> > > > > match, or be greater than the first n bytes of s2. >> > > > > >> > > > >> > > > +1 to this point. >> > > > >> > > > Also, if I read your quoted performance numbers in your earlier mail >> > > correctly, >> > > > we are only looking at a 1-4% performance increase. Is the >> additional >> > > code to >> > > > maintain worth the benefit? >> > > >> > > Yep, same thought here, is it really worth it? >> > > Konstantin >> > > >> > > > >> > > > /Bruce >> > > > >> > > > > -- >> > > > > Pawel >> > > >> > >> > I think I haven't exploited every thing x86 has to offer to improve >> > performance. I am looking for inputs. Until we have exhausted all >> avenues I >> > don't want to drop it. One thing I have noticed is that bigger key size >> > gets better performance numbers. I plan to re-run perf tests with 64 and >> > 128 bytes key size and will report back. Any other avenues to try out >> > please let me know I will give it a shot. >> > >> > Thanks, >> > Ravi >> >> Hi Ravi, >> >> are 128 byte comparisons realistic? An IPv6 5-tuple with double vlan tags >> is still >> only 41 bytes, or 48 with some padding added? >> While for a memcpy function, you can see cases where you are going to >> copy a whole >> packet, meaning that sizes of 128B+ (up to multiple k) are realistic, >> it's harder >> to see that for a compare function. >> >> In any case, we await the results of your further optimization work to >> see how >> that goes. >> >> > Actually I was looking at wrong numbers. Wrote couple of sample programs and found that memory comparison with AVX/SSE takes almost 1/3rd less cpu ticks when compared with regular memcmp. For 16bytes, regular memcmp Time: 276 ticks (3623188 memcmp/tick) Time: 276 ticks (3623188 memcmp/tick) memcmp with AVX/SSE Time: 86 ticks (11627906 memcmp/tick) Time: 87 ticks (11494252 memcmp/tick) For 32bytes, regular memcmp Time: 301 ticks (3322259 memcmp/tick) Time: 302 ticks (3311258 memcmp/tick) memcmp with AVX/SSE Time: 87 ticks (11494252 memcmp/tick) Time: 88 ticks (11363636 memcmp/tick) For 64bytes, regular memcmp Time: 376 ticks (2855696 memcmp/tick) 0 Time: 377 ticks (2848121 memcmp/tick) 0 memcmp with AVX/SSE Time: 110 ticks (9761289 memcmp/tick) 0 Time: 110 ticks (9761289 memcmp/tick) 0 With some modifications to original patch, and looking through test_hash_perf which has statistics for every test (Add on empty, Add update, Lookup) it performs, in almost all categories (16, 32, 48 and 64 bytes) AVX/SSE beats regular memcmp. Please note that the time measured in test_hash_perf is for hash functions (jhash and hash_crc) and memcmp is just a small part of the hash functionality. I will send modified patch later on. Thanks, Ravi > Hi Bruce, > > Couple of things I am planning to try > > 1. Use _xor_ and _testz_ instructions for comparison instead of _cmpeq_ > and _mask_. > 2. I am using unaligned loads, not sure about the penalty, I plan to try > with aligned loads if address is aligned and compare results. > > Agreed that with just L3 or even if we go with L2 + L3 + L4 tuples it will > not exceed 64 bytes, 128 bytes is just a stretch for some weird MPLSoGRE > header formats. > > My focus is currently on improving performance for < 64 bytes and < 128 > bytes key lengths only. > > Thanks, > Ravi > > Regards, >> /Bruce >> > >