Re: benchmarking Flex practices

John Naylor Thu, 01 Aug 2019 01:51:57 -0700

On Mon, Jul 29, 2019 at 10:40 PM Tom Lane <t...@sss.pgh.pa.us> wrote:
>
> John Naylor <john.nay...@2ndquadrant.com> writes:
>
> > The lexer returns UCONST from xus and UIDENT from xui. The grammar has
> > rules that are effectively:
>
> > SCONST { do nothing}
> > | UCONST { esc char is backslash }
> > | UCONST UESCAPE SCONST { esc char is from $3 }
>
> > ...where UESCAPE is now an unreserved keyword. To prevent shift-reduce
> > conflicts, I added UIDENT to the %nonassoc precedence list to match
> > IDENT, and for UESCAPE I added a %left precedence declaration. Maybe
> > there's a more principled way. I also added an unsigned char type to
> > the %union, but it worked fine on my compiler without it.
>
> I think it might be better to drop the separate "Uescape" production and
> just inline that into the calling rules, exactly per your sketch above.
> You could avoid duplicating the escape-checking logic by moving that into
> the str_udeescape support function.  This would avoid the need for the
> "uchr" union variant, but more importantly it seems likely to be more
> future-proof: IME, any time you can avoid or postpone shift/reduce
> decisions, it's better to do so.
>
> I didn't try, but I think this might allow dropping the %left for
> UESCAPE.  That bothers me because I don't understand why it's
> needed or what precedence level it ought to have.


I tried this, and removing the %left still gives me a shift/reduce
conflict, so I put some effort in narrowing down what's happening. If
I remove the rules with UESCAPE individually, I find that precedence
is not needed for Sconst -- only for Ident. I tried reverting all the
rules to use the original "IDENT" token and one by one changed them to
"Ident", and found 6 places where doing so caused a shift-reduce
conflict:

createdb_opt_name
xmltable_column_option_el
ColId
type_function_name
NonReservedWord
ColLabel

Due to the number of affected places, that didn't seem like a useful
avenue to pursue, so I tried the following:

-Making UESCAPE a reserved keyword or separate token type works, but
other keyword types don't work. Not acceptable, but maybe useful info.
-Giving UESCAPE an %nonassoc precedence above UIDENT works, even if
UIDENT is the lowest in the list. This seems the least intrusive, so I
went with that for v8. One possible downside is that UIDENT now no
longer has the same precedence as IDENT. Not sure if it matters, but
could we fix that contextually with "%prec IDENT"?

> > litbuf_udeescape() and check_uescapechar() were moved to gram.y. The
> > former had be massaged to give error messages similar to HEAD. They're
> > not quite identical, but the position info is preserved. Some of the
> > functions I moved around don't seem to have any test coverage, so I
> > should eventually do some work in that regard.
>
> I don't terribly like the cross-calls you have between gram.y and scan.l
> in this formulation.  If we have to make these functions (hexval() etc)
> non-static anyway, maybe we should shove them all into scansup.c?

I ended up making them static inline in scansup.h since that seemed to
reduce the performance impact (results below). I cribbed some of the
surrogate pair queries from the jsonpath regression tests so we have
some coverage here. Diff'ing from HEAD to patch, the locations are
different for a couple cases (a side effect of the differen error
handling style from scan.l). The patch seems to consistently point at
an escape sequence, so I think it's okay to use that. HEAD, on the
other hand, sometimes points at the start of the whole string:

 select U&'\de04\d83d'; -- surrogates in wrong order
-psql:test_unicode.sql:10: ERROR:  invalid Unicode surrogate pair at
or near "U&'\de04\d83d'"
+psql:test_unicode.sql:10: ERROR:  invalid Unicode surrogate pair
 LINE 1: select U&'\de04\d83d';
-               ^
+                  ^
 select U&'\de04X'; -- orphan low surrogate
-psql:test_unicode.sql:12: ERROR:  invalid Unicode surrogate pair at
or near "U&'\de04X'"
+psql:test_unicode.sql:12: ERROR:  invalid Unicode surrogate pair
 LINE 1: select U&'\de04X';
-               ^
+                  ^

> > -Performance is very close to v6 with the information_schema and
> > pgbench-like queries with standard strings, which is to say also very
> > close to HEAD. When the latter was changed to use Unicode escapes,
> > however, it was about 15% slower than HEAD. That's a big regression
> > and I haven't tried to pinpoint why.
>
> I don't quite follow what you changed to produce the slower test case?
> But that seems to be something we'd better run to ground before
> deciding whether to go this way.

So "pgbench str" below refers to driving the parser with this set of
queries repeated a couple hundred times in a string:

BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + 'foobarbaz' WHERE
aid = 'foobarbaz';
SELECT abalance FROM pgbench_accounts WHERE aid = 'foobarbaz';
UPDATE pgbench_tellers SET tbalance = tbalance + 'foobarbaz' WHERE tid
= 'foobarbaz';
UPDATE pgbench_branches SET bbalance = bbalance + 'foobarbaz' WHERE
bid = 'foobarbaz';
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES
('foobarbaz', 'foobarbaz', 'foobarbaz', 'foobarbaz',
CURRENT_TIMESTAMP);
END;

and "pgbench uesc" is the same, but the string is

U&'d!0061t!+000061'
uescape
'!'

Now that I think of it, the regression in v7 was largely due to the
fact that the parser has to call the lexer 3 times per string in this
case, and that's going to be slower no matter what we do. I added a
separate test with ordinary backslash escapes ("pgbench unicode"),
rebased v6-8 onto the same commit on master, and reran the performance
tests. The runs are generally +/- 1%:

                master    v6        v7        v8
info-schema     1.49s    1.48s    1.50s    1.53s
pgbench str     1.12s    1.13s    1.15s    1.17s
pgbench unicode 1.29s    1.29s    1.40s    1.36s
pgbench uesc    1.42s    1.44s    1.64s    1.58s

Inlining hexval() and friends seems to have helped somewhat for
unicode escapes, but I'd have to profile to improve that further.
However, v8 has regressed from v7 enough with both simple strings and
the information schema that it's a noticeable regression from HEAD.
I'm guessing getting rid of the "Uescape" production is to blame, but
I haven't tried reverting just that one piece. Since inlining the
rules didn't seem to help with the precedence hacks, it seems like the
separate production was a better way. Thoughts?

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

v8-draft-handle-uescapes-in-parser.patch
Description: Binary data

Re: benchmarking Flex practices

Reply via email to