[HACKERS] TODO item: Implement Boyer-Moore searching in LIKE queries

Karan Sikka Mon, 01 Aug 2016 10:20:46 -0700

Hi pgsql-hackers,

Following the patch to implement strpos with Boyer-Moore-Horspool,
it was proposed we bring BMH to LIKE as well.


Original thread:
https://www.postgresql.org/message-id/flat/27645.1220635769%40sss.pgh.pa.us#27645.1220635...@sss.pgh.pa.us

I'm a first time hacker and I found this task on the TODO list. It seemed
interesting and isolated enough. But after looking at the code in
like_match.c, it's clearly a non-trivial task.

How desirable is this feature? To begin answering that question,
I did some initial benchmarking with an English text corpus to find how much
faster BMH (Boyer-Moore-Horspool) would be compared to LIKE, and the results
were as follows: BMH can be up to 20% faster than LIKE, but for short
substrings, it's roughly comparable or slower.

Here are the results visualized:

http://ctrl-c.club/~ksikka/pg/like-strpos-short-1469975400.png
http://ctrl-c.club/~ksikka/pg/like-strpos-long-1469975400.png

Data attached, and description of the benchmark below.

I'd love to hear your thoughts:
    - is the benchmark valid? am I missing anything?
    - what conclusions can we draw from the results?
    - what action should we take from here?
    - is this the right mailing list for this discussion?

Thank you!
- Karan, pg community newbie


--- Benchmark Details ---

The easiest way to approximate the potential speedup from BMH,
is to compare the performance of the following queries:

1. SELECT count(*) FROM test_table WHERE text LIKE '%substring%';
2. SELECT count(*) FROM test_table WHERE strpos('substring', text) > 0;

We expect the strpos query to outperform the LIKE query since strpos is
implemented with BMH.

I loaded up the database with chunks of english text from the bible, a
commonly
used corpus for simple text analysis. The exact procedure is described in
more
detail at the end of the email. I ran a few queries, using short substrings
and
long substrings, the choice of which is discussed in more detail below.

## Choice of test data

BMH is known to be particularly fast on english text, so I loaded the test
table
with 5k-character chunks of text from the bible. BMH is expected to be
slower
for small substrings where the overhead of creating a skip table may not be
justified. For larger substrings, BMH may outperform LIKE due to the skip
table.
In the best case, if a text character does not exist in the substring, the
substring can jump length-of-substring characters, skipping what would be a
lot
of work.

I chose small (< 15 character) substrings to be the most popular bigrams and
trigrams in the text.  I chose long (10-250 character) substrings at
random, and
took varied-length prefixes and suffixes to see how it affected algorithm
performance. The reason that prefixes were not sufficient is that BMH
compares
from the right end of the substring, unlike the LIKE algorithm.  The full
strings are included in the attached excel file.

## Database setup

Download the corpus (englishTexts) from
http://www.dmi.unict.it/~faro/smart/corpus.php

Create the table:

    CREATE TABLE test_table (text text);

Generate the rows for insertion (generates chunks of 5k characters):

    wget http://ctrl-c.club/~ksikka/pg/gen_rows.py
    python gen_rows.py path/to/englishTexts/bible.txt > path/to/output/file

Load the table:

    COPY test_table (text) from '/absolute/path/to/previous/output/file'
WITH ENCODING 'UTF-8';

I ran the COPY command 21 times to exacerbate performance.

Queries were timed with psql's \timing feature. The mean of five queries was
reported.

## Hardware

Apple Macbook Air (Mid 2012)
CPU: 1.8 GHz Intel Core i5
RAM: 4 GB 1600 MHz DDR3

import io
import re
import sys
import random

## Configuration ##
CHUNK_LEN = 5000

# *fix, ie prefix, suffix.
starfix_lengths = [
    10,
    50,
    90,
    130,
    170,
    210,
    250
]

def copyEscape(ustring):
    """
    From the COPY docs:
        Backslash characters (\) can be used in the COPY data to
        quote data characters that might otherwise be taken as
        row or column delimiters.
        In particular, the following characters must be preceded
        by a backslash if they appear as part of a column value:
            backslash itself, newline, carriage return, and the
            current delimiter character.

    :LINK: https://www.postgresql.org/docs/9.1/static/sql-copy.html
    """
    return re.sub(r'([\\\r\n\t])', r'\\\1', ustring, flags=re.UNICODE)

if __name__ == '__main__':
    input_filename = sys.argv[1]
    input_stream = io.open(input_filename, mode='r', encoding='Latin-1')

    prefix_mode = False
    suffix_mode = False
    if sys.argv[2] == '--prefix':
        prefix_mode = True
    if sys.argv[2] == '--suffix':
        suffix_mode = True

    chunks = []

    chunk = None
    while chunk is None or len(chunk) == CHUNK_LEN:
        chunk = input_stream \
            .read(CHUNK_LEN)

        if chunk == '':
            break

        chunks.append(chunk)

    if prefix_mode or suffix_mode:
        rand_chunk = random.choice(chunks)

        rand_start = 0
        rand_end = len(rand_chunk) - 1

        if prefix_mode:
            rand_end -= starfix_lengths[-1]
        if suffix_mode:
            rand_start += starfix_lengths[-1]

        seed_i = random.randint(rand_start, rand_end)
        for i in starfix_lengths:
            if prefix_mode:
                print repr(rand_chunk[seed_i : seed_i + i + 1])
            if suffix_mode:
                print repr(rand_chunk[seed_i - i : seed_i + 1])
    else:
        print u'\n'.join([copyEscape(c) for c in chunks]).encode('utf-8')

like-strpos-comparison.xlsx
Description: MS-Excel 2007 spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] TODO item: Implement Boyer-Moore searching in LIKE queries

Reply via email to