On Fri, 2021-07-23 at 17:42 +0200, Pavel Stehule wrote:
> čt 22. 7. 2021 v 0:12 odesílatel Jacob Champion <pchamp...@vmware.com> napsal:
> > 
> > Pavel, I'd be interested to see what your benchmarks find with this
> > code. Does this fix the original issue for you?
> 
> I can confirm that the original issue is fixed. 

Great!

> I tested performance
> 
> I had three data sets
> 
> 1. typical data - mix ascii and utf characters typical for czech
> language - 25K lines - there is very small slowdown 2ms from 24 to
> 26ms (stored file of this result has 3MB)
> 
> 2. the worst case - this reports has only emoji 1000 chars * 10K rows
> - there is more significant slowdown - from 160 ms to 220 ms (stored
> file has 39MB)

I assume the stored file size has grown with this patch, since we're
now printing the correct number of spacing/border characters?

> 3. a little bit of obscure datasets generated by \x and select * from
> pg_proc - it has 99K lines - and there are a lot of unicode
> decorations (borders). The line has 17K chars. (the stored file has
> 1.7GB)
> In this dataset I see a slowdown from 4300 to 4700 ms.

These results are lining up fairly well with my profiling. To isolate
the effects of the new algorithm (as opposed to printing time) I
redirected to /dev/null:

    psql postgres -c "select repeat(unistr('\u115D'), 10000000);" > /dev/null

This is what I expect to be the worst case for the new patch: a huge
string consisting of nothing but \u115D, which is in the first interval
and therefore takes the maximum number of iterations during the binary
search.

For that command, the wall time slowdown with this patch was about
240ms (from 1.128s to 1.366s, or 21% slower). Callgrind shows an
increase of 18% in the number of instructions executed with the
interval table patch, all of it coming from PQdsplen (no surprise).
PQdsplen itself has a 36% increase in instruction count for that run.

I also did a microbenchmark of PQdsplen (code attached, requires Google
Benchmark [1]). The three cases I tested were standard ASCII
characters, a smiley-face emoji, and the worst-case \u115F character.

Without the patch:

------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
...
BM_Ascii_mean           4.97 ns         4.97 ns            5
BM_Ascii_median         4.97 ns         4.97 ns            5
BM_Ascii_stddev        0.035 ns        0.035 ns            5
...
BM_Emoji_mean           6.30 ns         6.30 ns            5
BM_Emoji_median         6.30 ns         6.30 ns            5
BM_Emoji_stddev        0.045 ns        0.045 ns            5
...
BM_Worst_mean           12.4 ns         12.4 ns            5
BM_Worst_median         12.4 ns         12.4 ns            5
BM_Worst_stddev        0.038 ns        0.038 ns            5

With the patch:

------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
...
BM_Ascii_mean           4.59 ns         4.59 ns            5
BM_Ascii_median         4.60 ns         4.60 ns            5
BM_Ascii_stddev        0.069 ns        0.068 ns            5
...
BM_Emoji_mean           11.8 ns         11.8 ns            5
BM_Emoji_median         11.8 ns         11.8 ns            5
BM_Emoji_stddev        0.059 ns        0.059 ns            5
...
BM_Worst_mean           18.5 ns         18.5 ns            5
BM_Worst_median         18.5 ns         18.5 ns            5
BM_Worst_stddev        0.077 ns        0.077 ns            5

So an incredibly tiny improvement in the ASCII case, which is
reproducible across multiple runs and not just a fluke (I assume
because the code is smaller now and has better cache line
characteristics?). A ~90% slowdown for the emoji case, and a ~50%
slowdown for the worst-performing characters. That seems perfectly
reasonable considering we're talking about dozens of nanoseconds.

> 9% looks too high, but in absolute time it is 400ms for 99K lines and
> very untypical data, or 2ms for more typical results., 2ms are
> nothing (for interactive work). More - this is from a pspg
> perspective. In psql there can be overhead of network, protocol
> processing, formatting, and more and more, and psql doesn't need to
> calculate display width of decorations (borders), what is the reason
> for slowdowns in pspg.

Yeah. Considering the alignment code is for user display, the absolute
performance is going to dominate, and I don't see any red flags so far.
If you're regularly dealing with unbelievably huge amounts of emoji, I
think the amount of extra time we're seeing here is unlikely to be a
problem. If it is, you can always turn alignment off. (Do you rely on
horizontal alignment for lines with millions of characters, anyway?)

Laurenz, Michael, what do you think?

--Jacob

[1] https://github.com/google/benchmark
/*
 * Copyright 2021 VMware, Inc.
 * SPDX-License-Identifier: PostgreSQL
 */

#include <benchmark/benchmark.h>

#include <libpq-fe.h>

static void BM_Ascii(benchmark::State& state)
{
    int enc = pg_char_to_encoding("UTF8");

    for (auto _ : state) {
        PQdsplen("a", enc);
    }
}

BENCHMARK(BM_Ascii);

static void BM_Emoji(benchmark::State& state)
{
    int enc = pg_char_to_encoding("UTF8");

    for (auto _ : state) {
        PQdsplen(u8"\U0001F600", enc);
    }
}

BENCHMARK(BM_Emoji);

static void BM_Worst(benchmark::State& state)
{
    int enc = pg_char_to_encoding("UTF8");

    for (auto _ : state) {
        PQdsplen(u8"\u115F", enc);
    }
}

BENCHMARK(BM_Worst);

BENCHMARK_MAIN();
#
# Copyright 2021 VMware, Inc.
# SPDX-License-Identifier: PostgreSQL
#

CC := $(CXX)

CPPFLAGS := -I$(shell pg_config --includedir)
LDFLAGS := -pthread -L$(shell pg_config --libdir)
LDLIBS := -lbenchmark -lpq

main: main.o

.PHONY: clean
clean:
	rm -f main main.o

.PHONY: check
check: main
	sudo cpupower frequency-set --governor performance
	./main --benchmark_repetitions=5
	sudo cpupower frequency-set --governor powersave

Reply via email to