question regarding copyData containers
Hello, I have been working on a node.js streaming client for different COPY scenarios. usually, during CopyOut, clients tend to buffer network chunks until they have gathered a full copyData message and pass that to the user. In some cases, this can lead to very large copyData messages. when there are very long text fields or bytea fields it will require a lot of memory to be handled (up to 1GB I think in the worst case scenario) In COPY TO, I managed to relax that requirement, considering that copyData is simply a transparent container. For each network chunk, the relevent message content is forwarded which makes for 64KB chunks at most. If that makes things clearer, here is an example scenarios, with 4 network chunks received and the way they are forwarded to the client. in: CopyData Int32Len Byten1 in: Byten2 in: Byten3 in: CopyData Int32Len Byten4 out: Byten1 out: Byten2 out: Byten3 out: Byten4 We loose the semantics of the "row" that copyData has according to the documentation https://www.postgresql.org/docs/10/protocol-flow.html#PROTOCOL-COPY >The backend sends a CopyOutResponse message to the frontend, followed by zero or more >CopyData messages (**always one per row**), followed by CopyDone but it is not a problem because the raw bytes are still parsable (rows + fields) in text mode (tsv) and in binary mode) Now I started working on copyBoth and logical decoding scenarios. In this case, the server send series of copyData. 1 copyData containing 1 message : at the network chunk level, in the case of large fields, we can observe in: CopyData Int32 XLogData Int64 Int64 Int64 Byten1 in: Byten2 in: CopyData Int32 XLogData Int64 Int64 Int64 Byten3 in: CopyData Int32 XLogData Int64 Int64 Int64 Byten4 out: XLogData Int64 Int64 Int64 Byten1 out: Byten2 out: XLogData Int64 Int64 Int64 Byten3 out: XLogData Int64 Int64 Int64 Byten4 but at the XLogData level, the protocol is not self-describing its length, so there is no real way of knowing where the first XLogData ends apart from - knowing the length of the first copyData (4 + 1 + 3*8 + n1 + n2) - knowing the internals of the output plugin and benefit from a plugin that self-describe its span when a network chunks contains several copyDatas in: CopyData Int32 XLogData Int64 Int64 Int64 Byten1 CopyData Int32 XLogData Int64 Int64 Int64 Byten2 we have out: XLogData Int64 Int64 Int64 Byten1 XLogData Int64 Int64 Int64 Byten2 and with test_decoding for example it is impossible to know where the test_decoding output ends without remembering the original length of the copyData. now my question is the following : is it ok to consider that over the long term copyData is simply a transport container that exists only to allow the multiplexing of events in the protocol but that messages inside could be chunked over several copyData events ? if we put test_decoding apart, do you consider that output plugins XLogData should be self-aware of their length ? I suppose (but did not fully verify yet) that this is the case for pgoutput ? I suppose that wal2json could also be parsed by balancing the brackets. I am wondering because when a client sends copyData to the server, the documentation says >The message boundaries are not required to have anything to do with row boundaries, >although that is often a reasonable choice. I hope that my message will ring a bell on the list. I tried the best I could to describe my very specific research. Thank you for your help, --- Jérôme
Re: question regarding copyData containers
Hello, thank you for your feedback. I agree that modifying the COPY subprotocols is hard to do because it would have an impact on the client ecosystem. My understanding (which seems to be confirmed by what Tom Lane said) is that the server discards the framing and manages to make sense of the underlying data. > the expectation is that clients can send CopyData messages that are > split up however they choose; the message boundaries needn't correspond > to any semantic boundaries in the data stream. So I thought that a client could decide to have the same behavior and could start parsing the payload of a copyData message without assembling it first. It works perfectly with COPY TO but I hit a roadblock on copyBoth during logical replication with test_decoding because the subprotocol doesn't have any framing. > Right now all 'w' messages should be contained in one CopyData/'d' that > doesn't contain anything but the XLogData/'w'. The current format of the XLogData/'w' message is w lsn lsn time byten and even if it is maybe too late now I was wondering why it was not decided to be w lsn lsn time n byten because it seems to me that the missing n ties the XLogData to the copyData framing. >The input data exists in a linear >buffer already, so you're not going to reduce peak memory usage by >sending smaller CopyData chunks. That is very surprising to me. Do you mean that on the server in COPY TO mode, a full row is prepared in a linear buffer in memory before beeing sent as a copyData/d' I found the code around https://github.com/postgres/postgres/blob/master/src/backend/commands/copy.c#L2153 and indeed the whole row seems to be buffered in memory. Good thing or bad thing, users tend to use bigger fields (text, jsonb, bytea) and that can be very memory hungry. Do you know a case in postgres (other than large_objects I suppose) where the server can flush data from a field without buffering it in memory ? And then as you noted, there is the multiplexing of events. a very long copyData makes the communication impossible between the client and the server during the transfer. I briefly looked at https://github.com/postgres/postgres/blob/master/src/backend/replication/walsender.c and I found /* * Maximum data payload in a WAL data message. Must be >= XLOG_BLCKSZ. * * We don't have a good idea of what a good value would be; there's some * overhead per message in both walsender and walreceiver, but on the other * hand sending large batches makes walsender less responsive to signals * because signals are checked only between messages. 128kB (with * default 8k blocks) seems like a reasonable guess for now. */ #define MAX_SEND_SIZE (XLOG_BLCKSZ * 16) so I thought that the maximum copyData/d' I would receive during logical replication was MAX_SEND_SIZE but it seems that this is not used for logical decoding. the whole output of the output plugin seem to be prepared in memory so for an insert like insert into mytable (col) values (repeat('-', pow(2, 27)::int) a 128MB linear buffer will be created on the server and sent as 1 copyData over many network chunks. So I understand that in the long term copyData framing should not carry any semantic to be able to keep messages small enough to allow multiplexing but that there are many steps to climb before that. Would it make sense one day in some way to try and do streaming at the sub-field level ? I guess that is a huge undertaking since most of the field unit interfaces are probably based on a buffer/field one-to-one mapping. Greetings, Jérôme On Thu, Jun 4, 2020 at 12:08 AM Andres Freund wrote: > Hi, > > On 2020-06-03 19:28:12 +0200, Jerome Wagner wrote: > > I have been working on a node.js streaming client for different COPY > > scenarios. > > usually, during CopyOut, clients tend to buffer network chunks until they > > have gathered a full copyData message and pass that to the user. > > > > In some cases, this can lead to very large copyData messages. when there > > are very long text fields or bytea fields it will require a lot of memory > > to be handled (up to 1GB I think in the worst case scenario) > > > > In COPY TO, I managed to relax that requirement, considering that > copyData > > is simply a transparent container. For each network chunk, the relevent > > message content is forwarded which makes for 64KB chunks at most. > > Uhm. > > > > We loose the semantics of the "row" that copyData has according to the > > documentation > > https://www.postgresql.org/docs/10/protocol-flow.html#PROTOCOL-COPY > > >The backend sends a CopyOutResponse message to the frontend, followed by > > zero or more >CopyData messages (**always one per row**), followed by > > CopyDone > > > &
COPY, bytea streaming and memory footprint
Hello, I am trying to understand/optimize how a COPY operation behaves when transfering a bytea from the database to a client. For simplicity, I'll consider that I have only one bytea _image_ in the _images_ table. Starting with COPY (SELECT image FROM images) TO STDOUT BINARY I understand that : - the server will create a linear buffer on the server side holding the whole image and then start sending it over the network in one big copyData message chunked in 64KB network chunks - the client can manage to extract this copyData payload by re-assembling those chunks in memory or by streaming the relevant data parts of the chunks elsewhere. so the problem I see in a streaming situation is that the server actually needs to buffer the whole image in memory. Now the image is already compressed so if I ALTER TABLE images ALTER image SET STORAGE EXTERNAL I can use the fact that substring on non compressed toasted values will fetch only the needed parts and do COPY ( SELECT ( SELECT substring(image from n for 65536) from images) FROM generate_series(1, (select length(image) from images), 65536) n ) TO STDOUT BINARY As I understand it, this would be less memory intensive on the server side if the server starts sending rows before all rows of the subselect are built because it would only need to prepare a sequence of 65536 bytes long buffers for the rows it would decide to have in memory. but is there a way to know if such a COPY/SELECT statement will indeed start sending rows before they are all prepared on the server ? Does it depend on the request and is there a difference if I add an order by on the select versus the natural order of the table ? How many rows will be needed in memory before the sending begins ? I hope my explanation was clear. I am looking for help in better understanding how the server decides to stream the COPY data out of the server vs the internal retrieval of the COPY'd subselect. Thank you Jérôme