On 04/02/2021 03:50, Kyotaro Horiguchi wrote:
At Wed, 3 Feb 2021 15:46:30 +0200, Heikki Linnakangas <hlinn...@iki.fi> wrote in
On 03/02/2021 15:38, John Naylor wrote:
On Wed, Feb 3, 2021 at 8:08 AM Heikki Linnakangas <hlinn...@iki.fi
<mailto:hlinn...@iki.fi>> wrote:
  >
  > Hi,
  >
  > While playing with COPY FROM refactorings in another thread, I noticed
  > corner case where I think backslash escaping doesn't work correctly.
  > Consider the following input:
  >
  > \么.foo
I've seen multibyte delimiters in the wild, so it's not as outlandish
as it seems.

We don't actually support multi-byte characters as delimiters or quote
or escape characters:

postgres=# copy copytest from 'foo' with (delimiter '么');
ERROR:  COPY delimiter must be a single one-byte character

The fix is simple enough, so +1.

Thanks, I'll commit and backpatch shortly.

I'm not sure the assumption in the second hunk always holds, but
that's fine at least with Shift-JIS and -2004 since they are two-byte
encoding.

The assumption is that a multi-byte character cannot have a special meaning, as far as the loop in CopyReadLineText is concerned. The characters with special meaning are '\\', '\n' and '\r'. That hold regardless of encoding.

Thinking about this a bit more, I think the attached patch is slightly better. Normally in the loop, raw_buf_ptr points to the next byte to consume, and 'c' is the last consumed byte. At the end of the loop, we check 'c' to see if it was a multi-byte character, and skip its 2nd, 3rd and 4th byte if necessary. The crux of the bug is that after the "raw_buf_ptr++;" to skip the character after the backslash, we left c to '\\', even though we already consumed the first byte of the next character. Because of that, the end-of-the-loop check didn't correctly treat it as a multi-byte character. So a more straightforward fix is to set 'c' to the byte we skipped over.

- Heikki
>From a9411516dcdcf4c91a9f88dd05741a1479dec7a3 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakan...@iki.fi>
Date: Thu, 4 Feb 2021 21:36:46 +0200
Subject: [PATCH v2 1/1] Fix a corner-case in COPY FROM backslash processing.

If a multi-byte character is escaped with a backslash in TEXT mode input,
we didn't always treat the escape correctly. If:

- a multi-byte character is escaped with a backslash, and
- the second byte of the character is 0x5C, i.e. the ASCII code of a
  backslash (\), and
- the next character is a dot (.), then

CopyReadLineText function would incorrectly interpret the sequence as an
end-of-copy marker (\.). This can only happen in encodings that can
"embed" ascii characters as the second byte. One example of such sequence
is '\x5ca45c2e666f6f' in Big5 encoding. If you put that in a file, and
load it with COPY FROM, you'd incorrectly get an "end-of-copy marker
corrupt" error.

Backpatch to all supported versions.

Reviewed-by: John Naylor, Kyotaro Horiguchi
Discussion: https://www.postgresql.org/message-id/a897f84f-8dca-8798-3139-07da5bb38728%40iki.fi
---
 src/backend/commands/copyfromparse.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index b843d315b1..315b16fd7a 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -1084,7 +1084,7 @@ CopyReadLineText(CopyFromState cstate)
 				break;
 			}
 			else if (!cstate->opts.csv_mode)
-
+			{
 				/*
 				 * If we are here, it means we found a backslash followed by
 				 * something other than a period.  In non-CSV mode, anything
@@ -1095,8 +1095,16 @@ CopyReadLineText(CopyFromState cstate)
 				 * backslashes are not special, so we want to process the
 				 * character after the backslash just like a normal character,
 				 * so we don't increment in those cases.
+				 *
+				 * Set 'c' to skip whole character correctly in multi-byte
+				 * encodings.  If we don't have the whole character in the
+				 * buffer yet, we might loop back to process it, after all,
+				 * but that's OK because multi-byte characters cannot have any
+				 * special meaning.
 				 */
 				raw_buf_ptr++;
+				c = c2;
+			}
 		}
 
 		/*
-- 
2.30.0

Reply via email to