Re: [PATCH] Reject ENCODING option for COPY TO FORMAT JSON

Andrew Dunstan Mon, 04 May 2026 07:19:49 -0700


On 2026-04-29 We 12:49 PM, Ayush Tiwari wrote:

Hi,

On Mon, 20 Apr 2026 at 20:31, Ayush Tiwari<[email protected]> wrote:


    Hi,


        On Mon, 20 Apr 2026 at 19:09, Tom Lane <[email protected]> wrote:

            Seems to me the correct thing here is to make it work like
            the other
            cases, ie perform pg_server_to_any().  I have exactly no
            sympathy for
            the argument about the RFC saying it must be UTF-8, not
            least because
            that's not in fact what is implemented (what if the server
            encoding
            isn't UTF-8?).


        Agreed. I initially thought rejecting the option was the safer
        route
        given the RFC, but as you pointed out, we aren't enforcing
        UTF-8 strictly on the server side anyway.


            Rejecting this option altogether doesn't improve anything, not
            functionally, not specs-compliance-wise, nor according to the
            principle of least surprise.

        Makes sense. Implementing the conversion properly
        keeps JSON format consistent with how the text and CSV formats
        behave.


            No, you don't get to punt this till later.  Once we ship
            v19 there's
            going to be a strong expectation of backwards compatibility.

            The idea of sending UTF-8 to a client that's set
            client_encoding to
            something else would be risible, if it weren't a security
            hazard.


        I agree sending unconverted bytes to a mismatched
        client encoding is clearly a security hazard that needs
        addressing. Did
        not consider the backward compatibility part, my bad.

        Was trying out adding  pg_server_to_any() to the json_buf after
        composite_to_json() returns,
        correctly covering both explicit ENCODING option
        specifications and
        implicit client_encoding mismatches.

        Let me send a patch with code and associated test cases.

    Attached patch with round trip test case. Please review and let me
    know if it's in the right direction.


I have registered this patch set in the CommitFest for tracking:
https://commitfest.postgresql.org/patch/6700/

Please let me know if the patch looks good, and if I need to add it
in the open items list for PG 19.

Basically good, I think. I have modified your test a bit, testing moredirectly for the presence of the LATIN-1 encoded character and theabsence of the UTF-8 encoded character, by reading in the file withpg_read_binary_file, and adding a test for implicit encoding by settingclient_encoding.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

From ae435abf77305a6e31fd03b11fbc1b17ea4a4c50 Mon Sep 17 00:00:00 2001
From: Andrew Dunstan <[email protected]>
Date: Mon, 20 Apr 2026 11:16:36 +0530
Subject: [PATCH v4] Apply encoding conversion in COPY TO FORMAT JSON

CopyToJsonOneRow() sent the output of composite_to_json() directly
via CopySendData() without encoding conversion.  The text and CSV
paths convert per-attribute via pg_server_to_any() when
need_transcoding is true, but the JSON path skipped this entirely.

This meant COPY ... TO ... WITH (FORMAT json, ENCODING 'LATIN1') on
a UTF-8 server silently produced UTF-8 output, and COPY TO STDOUT
with a non-UTF-8 client_encoding would send unconverted bytes to
the client.

Apply pg_server_to_any() to the whole JSON buffer after
composite_to_json() returns, converting to the requested file
encoding when it differs from the server encoding.  Tests cover
both the explicit ENCODING option and the implicit case where
file_encoding is inherited from client_encoding.

Introduced by 7dadd38cda9 (json format for COPY TO).

Author: Ayush Tiwari <[email protected]>
Reviewed-by: Andrew Dunstan <[email protected]>
Discussion: https://postgr.es/m/cajtyswx-jslzxgrab-dwnepgyrpbdyhwce8lctve92lidfm...@mail.gmail.com
---
 src/backend/commands/copyto.c              | 20 +++++++++++++++-
 src/test/regress/expected/copyencoding.out | 27 ++++++++++++++++++++++
 src/test/regress/sql/copyencoding.sql      | 19 +++++++++++++++
 3 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c
index 85d15353647..bb28c15339d 100644
--- a/src/backend/commands/copyto.c
+++ b/src/backend/commands/copyto.c
@@ -427,7 +427,25 @@ CopyToJsonOneRow(CopyToState cstate, TupleTableSlot *slot)
 		}
 	}
 
-	CopySendData(cstate, cstate->json_buf->data, cstate->json_buf->len);
+	/*
+	 * Convert the JSON output to the target encoding if needed.  Unlike the
+	 * text and CSV paths which convert per-attribute via CopyAttributeOut*,
+	 * composite_to_json() emits the whole row as one buffer, so we transcode
+	 * it here in a single call before sending.
+	 */
+	if (cstate->need_transcoding)
+	{
+		char	   *converted;
+
+		converted = pg_server_to_any(cstate->json_buf->data,
+									 cstate->json_buf->len,
+									 cstate->file_encoding);
+		CopySendData(cstate, converted, strlen(converted));
+		if (converted != cstate->json_buf->data)
+			pfree(converted);
+	}
+	else
+		CopySendData(cstate, cstate->json_buf->data, cstate->json_buf->len);
 
 	CopySendTextLikeEndOfRow(cstate);
 }
diff --git a/src/test/regress/expected/copyencoding.out b/src/test/regress/expected/copyencoding.out
index 76ea0e7cf04..1e7ebc7d0e6 100644
--- a/src/test/regress/expected/copyencoding.out
+++ b/src/test/regress/expected/copyencoding.out
@@ -50,4 +50,31 @@ COPY copy_encoding_tab FROM :'utf8_csv' WITH (FORMAT csv);
 ERROR:  invalid byte sequence for encoding "EUC_JP": 0xe3 0x81
 CONTEXT:  COPY copy_encoding_tab, line 1
 RESET client_encoding;
+-- JSON format encoding conversion
+\set json_latin1 :abs_builddir '/results/copyencoding_json_latin1.json'
+COPY (SELECT E'\u00e9' AS c) TO :'json_latin1' WITH (FORMAT json, ENCODING 'LATIN1');
+-- Verify the file contains LATIN1 'é' (single byte 0xe9) and not UTF-8 (0xc3 0xa9).
+-- Done as separate position checks to stay independent of the platform's
+-- end-of-line convention.
+SELECT position('\xe9'::bytea  IN pg_read_binary_file(:'json_latin1')) > 0 AS has_latin1_e9,
+       position('\xc3a9'::bytea IN pg_read_binary_file(:'json_latin1')) > 0 AS has_utf8_e9;
+ has_latin1_e9 | has_utf8_e9 
+---------------+-------------
+ t             | f
+(1 row)
+
+-- Same with implicit encoding inherited from client_encoding (no ENCODING
+-- option).  Covers the case where a client with a non-UTF8 client_encoding
+-- runs COPY ... FORMAT json and would otherwise receive unconverted bytes.
+\set json_implicit :abs_builddir '/results/copyencoding_json_implicit_latin1.json'
+SET client_encoding TO LATIN1;
+COPY (SELECT E'\u00e9' AS c) TO :'json_implicit' WITH (FORMAT json);
+RESET client_encoding;
+SELECT position('\xe9'::bytea  IN pg_read_binary_file(:'json_implicit')) > 0 AS has_latin1_e9,
+       position('\xc3a9'::bytea IN pg_read_binary_file(:'json_implicit')) > 0 AS has_utf8_e9;
+ has_latin1_e9 | has_utf8_e9 
+---------------+-------------
+ t             | f
+(1 row)
+
 DROP TABLE copy_encoding_tab;
diff --git a/src/test/regress/sql/copyencoding.sql b/src/test/regress/sql/copyencoding.sql
index 64718245b94..c8905da5567 100644
--- a/src/test/regress/sql/copyencoding.sql
+++ b/src/test/regress/sql/copyencoding.sql
@@ -57,4 +57,23 @@ SET client_encoding TO EUC_JP;
 COPY copy_encoding_tab FROM :'utf8_csv' WITH (FORMAT csv);
 RESET client_encoding;
 
+-- JSON format encoding conversion
+\set json_latin1 :abs_builddir '/results/copyencoding_json_latin1.json'
+COPY (SELECT E'\u00e9' AS c) TO :'json_latin1' WITH (FORMAT json, ENCODING 'LATIN1');
+-- Verify the file contains LATIN1 'é' (single byte 0xe9) and not UTF-8 (0xc3 0xa9).
+-- Done as separate position checks to stay independent of the platform's
+-- end-of-line convention.
+SELECT position('\xe9'::bytea  IN pg_read_binary_file(:'json_latin1')) > 0 AS has_latin1_e9,
+       position('\xc3a9'::bytea IN pg_read_binary_file(:'json_latin1')) > 0 AS has_utf8_e9;
+
+-- Same with implicit encoding inherited from client_encoding (no ENCODING
+-- option).  Covers the case where a client with a non-UTF8 client_encoding
+-- runs COPY ... FORMAT json and would otherwise receive unconverted bytes.
+\set json_implicit :abs_builddir '/results/copyencoding_json_implicit_latin1.json'
+SET client_encoding TO LATIN1;
+COPY (SELECT E'\u00e9' AS c) TO :'json_implicit' WITH (FORMAT json);
+RESET client_encoding;
+SELECT position('\xe9'::bytea  IN pg_read_binary_file(:'json_implicit')) > 0 AS has_latin1_e9,
+       position('\xc3a9'::bytea IN pg_read_binary_file(:'json_implicit')) > 0 AS has_utf8_e9;
+
 DROP TABLE copy_encoding_tab;
-- 
2.43.0

Re: [PATCH] Reject ENCODING option for COPY TO FORMAT JSON

Reply via email to