Re: speed up verifying UTF-8

Heikki Linnakangas Thu, 03 Jun 2021 12:09:16 -0700

On 03/06/2021 17:33, Greg Stark wrote:

3. It's probably cheaper perform the HAS_ZERO check just once on (half1

| half2). We have to compute (half1 | half2) anyway.


Wouldn't you have to check (half1 & half2) ?

Ah, you're right of course. But & is not quite right either, it willgive false positives. That's ok from a correctness point of view here,because we then fall back to checking byte by byte, but I don't thinkit's a good tradeoff.


I think this works, however:

/* Verify a chunk of bytes for valid ASCII including a zero-byte check. */
static inline int
check_ascii(const unsigned char *s, int len)
{
        uint64          half1,
                                half2,
                                highbits_set;
        uint64          x1,
                                x2;
        uint64          x;

        if (len >= 2 * sizeof(uint64))
        {
                memcpy(&half1, s, sizeof(uint64));
                memcpy(&half2, s + sizeof(uint64), sizeof(uint64));

                /* Check if any bytes in this chunk have the high bit set. */
                highbits_set = ((half1 | half2) & 
UINT64CONST(0x8080808080808080));
                if (highbits_set)
                        return 0;

                /*
                 * Check if there are any zero bytes in this chunk.
                 *
                 * First, add 0x7f to each byte. This sets the high bit in each 
byte,
                 * unless it was a zero. We already checked that none of the 
bytes had
                 * the high bit set previously, so the max value each byte can 
have
                 * after the addition is 0x7f + 0x7f = 0xfe, and we don't need 
to
                 * worry about carrying over to the next byte.
                 */
                x1 = half1 + UINT64CONST(0x7f7f7f7f7f7f7f7f);
                x2 = half2 + UINT64CONST(0x7f7f7f7f7f7f7f7f);

                /* then check that the high bit is set in each byte. */
                x = (x1 | x2);
                x &= UINT64CONST(0x8080808080808080);
                if (x != UINT64CONST(0x8080808080808080))
                        return 0;

                return 2 * sizeof(uint64);
        }
        else
                return 0;
}

- Heikki

Re: speed up verifying UTF-8

Reply via email to