On Mon, Jul 29, 2019 at 10:40 PM Tom Lane <t...@sss.pgh.pa.us> wrote: > > John Naylor <john.nay...@2ndquadrant.com> writes: > > > The lexer returns UCONST from xus and UIDENT from xui. The grammar has > > rules that are effectively: > > > SCONST { do nothing} > > | UCONST { esc char is backslash } > > | UCONST UESCAPE SCONST { esc char is from $3 } > > > ...where UESCAPE is now an unreserved keyword. To prevent shift-reduce > > conflicts, I added UIDENT to the %nonassoc precedence list to match > > IDENT, and for UESCAPE I added a %left precedence declaration. Maybe > > there's a more principled way. I also added an unsigned char type to > > the %union, but it worked fine on my compiler without it. > > I think it might be better to drop the separate "Uescape" production and > just inline that into the calling rules, exactly per your sketch above. > You could avoid duplicating the escape-checking logic by moving that into > the str_udeescape support function. This would avoid the need for the > "uchr" union variant, but more importantly it seems likely to be more > future-proof: IME, any time you can avoid or postpone shift/reduce > decisions, it's better to do so. > > I didn't try, but I think this might allow dropping the %left for > UESCAPE. That bothers me because I don't understand why it's > needed or what precedence level it ought to have.
I tried this, and removing the %left still gives me a shift/reduce conflict, so I put some effort in narrowing down what's happening. If I remove the rules with UESCAPE individually, I find that precedence is not needed for Sconst -- only for Ident. I tried reverting all the rules to use the original "IDENT" token and one by one changed them to "Ident", and found 6 places where doing so caused a shift-reduce conflict: createdb_opt_name xmltable_column_option_el ColId type_function_name NonReservedWord ColLabel Due to the number of affected places, that didn't seem like a useful avenue to pursue, so I tried the following: -Making UESCAPE a reserved keyword or separate token type works, but other keyword types don't work. Not acceptable, but maybe useful info. -Giving UESCAPE an %nonassoc precedence above UIDENT works, even if UIDENT is the lowest in the list. This seems the least intrusive, so I went with that for v8. One possible downside is that UIDENT now no longer has the same precedence as IDENT. Not sure if it matters, but could we fix that contextually with "%prec IDENT"? > > litbuf_udeescape() and check_uescapechar() were moved to gram.y. The > > former had be massaged to give error messages similar to HEAD. They're > > not quite identical, but the position info is preserved. Some of the > > functions I moved around don't seem to have any test coverage, so I > > should eventually do some work in that regard. > > I don't terribly like the cross-calls you have between gram.y and scan.l > in this formulation. If we have to make these functions (hexval() etc) > non-static anyway, maybe we should shove them all into scansup.c? I ended up making them static inline in scansup.h since that seemed to reduce the performance impact (results below). I cribbed some of the surrogate pair queries from the jsonpath regression tests so we have some coverage here. Diff'ing from HEAD to patch, the locations are different for a couple cases (a side effect of the differen error handling style from scan.l). The patch seems to consistently point at an escape sequence, so I think it's okay to use that. HEAD, on the other hand, sometimes points at the start of the whole string: select U&'\de04\d83d'; -- surrogates in wrong order -psql:test_unicode.sql:10: ERROR: invalid Unicode surrogate pair at or near "U&'\de04\d83d'" +psql:test_unicode.sql:10: ERROR: invalid Unicode surrogate pair LINE 1: select U&'\de04\d83d'; - ^ + ^ select U&'\de04X'; -- orphan low surrogate -psql:test_unicode.sql:12: ERROR: invalid Unicode surrogate pair at or near "U&'\de04X'" +psql:test_unicode.sql:12: ERROR: invalid Unicode surrogate pair LINE 1: select U&'\de04X'; - ^ + ^ > > -Performance is very close to v6 with the information_schema and > > pgbench-like queries with standard strings, which is to say also very > > close to HEAD. When the latter was changed to use Unicode escapes, > > however, it was about 15% slower than HEAD. That's a big regression > > and I haven't tried to pinpoint why. > > I don't quite follow what you changed to produce the slower test case? > But that seems to be something we'd better run to ground before > deciding whether to go this way. So "pgbench str" below refers to driving the parser with this set of queries repeated a couple hundred times in a string: BEGIN; UPDATE pgbench_accounts SET abalance = abalance + 'foobarbaz' WHERE aid = 'foobarbaz'; SELECT abalance FROM pgbench_accounts WHERE aid = 'foobarbaz'; UPDATE pgbench_tellers SET tbalance = tbalance + 'foobarbaz' WHERE tid = 'foobarbaz'; UPDATE pgbench_branches SET bbalance = bbalance + 'foobarbaz' WHERE bid = 'foobarbaz'; INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ('foobarbaz', 'foobarbaz', 'foobarbaz', 'foobarbaz', CURRENT_TIMESTAMP); END; and "pgbench uesc" is the same, but the string is U&'d!0061t!+000061' uescape '!' Now that I think of it, the regression in v7 was largely due to the fact that the parser has to call the lexer 3 times per string in this case, and that's going to be slower no matter what we do. I added a separate test with ordinary backslash escapes ("pgbench unicode"), rebased v6-8 onto the same commit on master, and reran the performance tests. The runs are generally +/- 1%: master v6 v7 v8 info-schema 1.49s 1.48s 1.50s 1.53s pgbench str 1.12s 1.13s 1.15s 1.17s pgbench unicode 1.29s 1.29s 1.40s 1.36s pgbench uesc 1.42s 1.44s 1.64s 1.58s Inlining hexval() and friends seems to have helped somewhat for unicode escapes, but I'd have to profile to improve that further. However, v8 has regressed from v7 enough with both simple strings and the information schema that it's a noticeable regression from HEAD. I'm guessing getting rid of the "Uescape" production is to blame, but I haven't tried reverting just that one piece. Since inlining the rules didn't seem to help with the precedence hacks, it seems like the separate production was a better way. Thoughts? -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
v8-draft-handle-uescapes-in-parser.patch
Description: Binary data