Hi hackers, When error "trailing junk after numeric literal" occurs at a number followed by a symbol that is presented by more than one byte, that symbol in the error message is not displayed correctly. Instead of that symbol there is only its first byte. That makes the error message an invalid UTF-8 (or whatever encoding is set). The whole log file where this error message goes also becomes invalid. That could lead to problems with reading logs. You can see an invalid message by trying "SELECT 123ä;".
Rejecting trailing junk after numeric literals was introduced in commit 2549f066 to prevent scanning a number immediately followed by an identifier without whitespace as number and identifier. All the tokens that made to catch such cases match a numeric literal and the next byte, and that is where the problem comes from. I thought that it could be fixed just by using tokens that match a numeric literal immediately followed by an identifier, not only one byte. This also improves error messages in cases with English letters. After these changes, for "SELECT 123abc;" the error message will say that the error appeared at or near "123abc" instead of "123a". I've attached the patch. Are there any pitfalls I can't see? It just keeps bothering me why wasn't it done from the beginning. Matching the whole identifier after a numeric literal just seems more obvious to me than matching its first byte. Best regards, Karina Litskevich Postgres Professional: http://postgrespro.com/
From 281303312471b4ef831f57d124208ce1699aed54 Mon Sep 17 00:00:00 2001 From: Karina Litskevich <litskevichkar...@gmail.com> Date: Tue, 27 Aug 2024 17:01:49 +0300 Subject: [PATCH v1] Improve error message for rejecting trailing junk after numeric literals Rejecting trailing junk after numeric literals was introduced in commit 2549f066 to prevent scanning a number immediately followed by an identifier without whitespace as number and identifier. Unfortunately, all the tokens made to catch such numeric literals followed by non-digits match a numeric literal and the next byte. The lexemes found by these tokens are broken in case the next symbol after a numeric literal is presented by several bytes as only the first byte of the symbol gets to the lexeme. When this lexeme is then printed as a part of an error message that message became broken too along with the whole log file where it goes. This commit fixes the problem by using tokens that match a numeric literal immediately followed by an identifier, not only one byte. This also improves error messages in cases with English letters. For 123abc the error message now will say that the error appeared at or near "123abc" instead of "123a". --- src/backend/parser/scan.l | 14 +++++++------- src/fe_utils/psqlscan.l | 14 +++++++------- src/interfaces/ecpg/preproc/pgc.l | 14 +++++++------- src/test/regress/expected/numerology.out | 14 +++++++++----- src/test/regress/sql/numerology.sql | 1 + 5 files changed, 31 insertions(+), 26 deletions(-) diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l index f74059e7b0..448f9b5afe 100644 --- a/src/backend/parser/scan.l +++ b/src/backend/parser/scan.l @@ -412,16 +412,16 @@ numericfail {decinteger}\.\. real ({decinteger}|{numeric})[Ee][-+]?{decinteger} realfail ({decinteger}|{numeric})[Ee][-+] -decinteger_junk {decinteger}{ident_start} -hexinteger_junk {hexinteger}{ident_start} -octinteger_junk {octinteger}{ident_start} -bininteger_junk {bininteger}{ident_start} -numeric_junk {numeric}{ident_start} -real_junk {real}{ident_start} +decinteger_junk {decinteger}{identifier} +hexinteger_junk {hexinteger}{identifier} +octinteger_junk {octinteger}{identifier} +bininteger_junk {bininteger}{identifier} +numeric_junk {numeric}{identifier} +real_junk {real}{identifier} /* Positional parameters don't accept underscores. */ param \${decdigit}+ -param_junk \${decdigit}+{ident_start} +param_junk \${decdigit}+{identifier} other . diff --git a/src/fe_utils/psqlscan.l b/src/fe_utils/psqlscan.l index ddc4658b92..282f117b81 100644 --- a/src/fe_utils/psqlscan.l +++ b/src/fe_utils/psqlscan.l @@ -348,16 +348,16 @@ numericfail {decinteger}\.\. real ({decinteger}|{numeric})[Ee][-+]?{decinteger} realfail ({decinteger}|{numeric})[Ee][-+] -decinteger_junk {decinteger}{ident_start} -hexinteger_junk {hexinteger}{ident_start} -octinteger_junk {octinteger}{ident_start} -bininteger_junk {bininteger}{ident_start} -numeric_junk {numeric}{ident_start} -real_junk {real}{ident_start} +decinteger_junk {decinteger}{identifier} +hexinteger_junk {hexinteger}{identifier} +octinteger_junk {octinteger}{identifier} +bininteger_junk {bininteger}{identifier} +numeric_junk {numeric}{identifier} +real_junk {real}{identifier} /* Positional parameters don't accept underscores. */ param \${decdigit}+ -param_junk \${decdigit}+{ident_start} +param_junk \${decdigit}+{identifier} /* psql-specific: characters allowed in variable names */ variable_char [A-Za-z\200-\377_0-9] diff --git a/src/interfaces/ecpg/preproc/pgc.l b/src/interfaces/ecpg/preproc/pgc.l index f363a34659..7887773277 100644 --- a/src/interfaces/ecpg/preproc/pgc.l +++ b/src/interfaces/ecpg/preproc/pgc.l @@ -381,16 +381,16 @@ numericfail {decinteger}\.\. real ({decinteger}|{numeric})[Ee][-+]?{decinteger} realfail ({decinteger}|{numeric})[Ee][-+] -decinteger_junk {decinteger}{ident_start} -hexinteger_junk {hexinteger}{ident_start} -octinteger_junk {octinteger}{ident_start} -bininteger_junk {bininteger}{ident_start} -numeric_junk {numeric}{ident_start} -real_junk {real}{ident_start} +decinteger_junk {decinteger}{identifier} +hexinteger_junk {hexinteger}{identifier} +octinteger_junk {octinteger}{identifier} +bininteger_junk {bininteger}{identifier} +numeric_junk {numeric}{identifier} +real_junk {real}{identifier} /* Positional parameters don't accept underscores. */ param \${decdigit}+ -param_junk \${decdigit}+{ident_start} +param_junk \${decdigit}+{identifier} /* special characters for other dbms */ /* we have to react differently in compat mode */ diff --git a/src/test/regress/expected/numerology.out b/src/test/regress/expected/numerology.out index 717a237df9..f4bd6fbfee 100644 --- a/src/test/regress/expected/numerology.out +++ b/src/test/regress/expected/numerology.out @@ -171,9 +171,13 @@ SELECT -0x8000000000000001; -- error cases SELECT 123abc; -ERROR: trailing junk after numeric literal at or near "123a" +ERROR: trailing junk after numeric literal at or near "123abc" LINE 1: SELECT 123abc; ^ +SELECT 1ä; +ERROR: trailing junk after numeric literal at or near "1ä" +LINE 1: SELECT 1ä; + ^ SELECT 0x0o; ERROR: trailing junk after numeric literal at or near "0x0o" LINE 1: SELECT 0x0o; @@ -322,7 +326,7 @@ ERROR: trailing junk after numeric literal at or near "100_" LINE 1: SELECT 100_; ^ SELECT 100__000; -ERROR: trailing junk after numeric literal at or near "100_" +ERROR: trailing junk after numeric literal at or near "100__000" LINE 1: SELECT 100__000; ^ SELECT _1_000.5; @@ -334,7 +338,7 @@ ERROR: trailing junk after numeric literal at or near "1_000_" LINE 1: SELECT 1_000_.5; ^ SELECT 1_000._5; -ERROR: trailing junk after numeric literal at or near "1_000._" +ERROR: trailing junk after numeric literal at or near "1_000._5" LINE 1: SELECT 1_000._5; ^ SELECT 1_000.5_; @@ -342,11 +346,11 @@ ERROR: trailing junk after numeric literal at or near "1_000.5_" LINE 1: SELECT 1_000.5_; ^ SELECT 1_000.5e_1; -ERROR: trailing junk after numeric literal at or near "1_000.5e" +ERROR: trailing junk after numeric literal at or near "1_000.5e_1" LINE 1: SELECT 1_000.5e_1; ^ PREPARE p1 AS SELECT $0_1; -ERROR: trailing junk after parameter at or near "$0_" +ERROR: trailing junk after parameter at or near "$0_1" LINE 1: PREPARE p1 AS SELECT $0_1; ^ -- diff --git a/src/test/regress/sql/numerology.sql b/src/test/regress/sql/numerology.sql index 3ae491cc98..d1a42b046c 100644 --- a/src/test/regress/sql/numerology.sql +++ b/src/test/regress/sql/numerology.sql @@ -44,6 +44,7 @@ SELECT -0x8000000000000001; -- error cases SELECT 123abc; +SELECT 1ä; SELECT 0x0o; SELECT 0.a; SELECT 0.0a; -- 2.34.1