Re: [perf-discuss] Followup on microoptimizing ip_addr_match()

Bart Smaalders Mon, 18 Jun 2007 13:10:02 -0700

Dan McDonald wrote:

(This time, using e-mail instead of the web form...)


Hello again!

After what suggestions I saw (all on networking-discuss...), I put together a
multiple-choice question.

Consider thie _LITTLE_ENDIAN section in this code fragment, which is known to
be an improvement on SPARC:

===================== (Cut up to and including here.) =====================

#ifdef _LITTLE_ENDIAN
/* For little-endian, we really need to think about this. */

#if 1
/* Clever math - thanks Nico! */
#define PREFIX_LOW32(pfxlen) \
        ((((uint8_t)((0xFF00 >> ((pfxlen) & 0x7)))) << ((pfxlen) & ~0x7)) | \
            (0xFFFFFF >> ((31 - (pfxlen)) & ~0x7)))
#endif

#if 0
/* ntohl() the big-endian solution */
#define PREFIX_LOW32(pfxlen) ntohl(0xFFFFFFFF << (32 - (pfxlen)))
#endif

#if 0
/* or use a table lookup */
static uint32_t masks[] = {
        0x00000000, 0x00000080, 0x000000C0, 0x000000E0,
        0x000000F0, 0x000000F8, 0x000000FC, 0x000000FE,
        0x000000FF, 0x000080FF, 0x0000C0FF, 0x0000E0FF,
        0x0000F0FF, 0x0000F8FF, 0x0000FCFF, 0x0000FEFF,
        0x0000FFFF, 0x0080FFFF, 0x00C0FFFF, 0x00E0FFFF,
        0x00F0FFFF, 0x00F8FFFF, 0x00FCFFFF, 0x00FEFFFF,
        0x00FFFFFF, 0x80FFFFFF, 0xC0FFFFFF, 0xE0FFFFFF,
        0xF0FFFFFF, 0xF8FFFFFF, 0xFCFFFFFF, 0xFEFFFFFF
};
#define PREFIX_LOW32(pfxlen) (masks[pfxlen])
#endif

/*
 * sleazy prefix-length-based compare.
 * another inlining candidate..
 */
boolean_t
ip_addr_match(uint32_t *addr1, int pfxlen, uint32_t *addr2)
{
        while (pfxlen >= 32) {
                if (*addr1 != *addr2)
                        return (B_FALSE);
                addr1++;
                addr2++;
                pfxlen -= 32;
        }
        return (pfxlen == 0 || ((*addr1 ^ *addr2) & PREFIX_LOW32(pfxlen)));
p}

===================== (Cut up to and including here.) =====================

And here's the original code:

===================== (Cut up to and including here.) =====================

/*
 * sleazy prefix-length-based compare.
 * another inlining candidate..
 */
boolean_t
ip_addr_match(uint8_t *addr1, int pfxlen, in6_addr_t *addr2p)
{
        int offset = pfxlen>>3;
        int bitsleft = pfxlen & 7;
        uint8_t *addr2 = (uint8_t *)addr2p;

        /*
         * and there was much evil..
         * XXX should inline-expand the bcmp here and do this 32 bits
         * or 64 bits at a time..
         */
        return ((bcmp(addr1, addr2, offset) == 0) &&
            ((bitsleft == 0) ||
                (((addr1[offset] ^ addr2[offset]) &
                    (0xff<<(8-bitsleft))) == 0)));
}

===================== (Cut up to and including here.) =====================


Experiments run using an IPsec and IKE test suite and using DTrace's FBT
indicate that the original function outperforms both the table lookup and the
"ntohl() the big-endian solution".  I have my suspicions about our in-lining
performance of ntohl(), but that's another topic.

I haven't run Nico's "clever math" solution yet, but would like to know what
the peanut gallery thinks.

BTW, here are the two big buckets using bcmp() on an opteron box with bcmp():

Number of calls == 410221.

             128 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           308602

256 |@@@@@@@@@@ 99140

and with table-lookup:

Number of calls == 412346.

128 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 292893256 |@@@@@@@@@@@ 117017


The htonl() ones were worse than table-lookup.  Those two buckets account for
the vast majority ( >99% of the sampled calls) of the calls to
ip_addr_match().

The sparc gains are bigger than the opteron lossage.  Here's bcmp():

Number of calls == 398975.

             128 |@@@@@@@@@@@@@@@@@@                       177120
             256 |@@@@@@@@@@@@@@@@                         159558

512 |@@@@@@ 58487

and using the simple math-only solution:

Number of calls == 397934.

             128 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         319443

256 |@@@@@@ 60991512 |@ 5972

So I'm not sure what to do.

Any clues are, as always, welcome!

Thanks,
Dan
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org



On x86, why not inline a bswap instruction?

- Bart


--
Bart Smaalders                  Solaris Kernel Performance
[EMAIL PROTECTED]               http://blogs.sun.com/barts
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Re: [perf-discuss] Followup on microoptimizing ip_addr_match()

Reply via email to