Hi Ashutosh, > TEXT copy may be useful for cross platform client side data > generation. BINARY might be useful for same platform client side > generation or server side generation. Just a thought, use TEXT or > BINARY automatically based on where it's cross-platform or same > platform setup.
It is true that BINARY format is not as flexible as TEXT. Postgres expects data in wire to arrive in "network byte order". AFAIK only Solaris can send its data without byte reordering. I support such exception via #ifdef __sparc__ I don't see an easy way to make decision within pgbench on which COPY mode to use TEXT or BINARY except specifically asking for one via command line parameter. This is why I left flag "g" for TEXT mode and added BINARY as "C" (for "Copy" and upper case as faster). I guess, we can add alias "c" for old client-side generation as "copy", but slower version of it. While we are on topic of client- vs server-side generation. IMHO these are quite misleading terms. Both of them are executed by RDBMS on server side, but "server" one gets quite short query (and quite slow in execution) and "client" one gets quite big network transfer (and quite fast in execution). The reason is difference in data path that needs to reflected in documentation. On top of it server-side thrashes DB cache, while client-side works via ring buffer that doesn't allocate more than 1/8 of shared_buffers, > I would review the commit a386942bd29b0ef0c9df061392659880d22cdf43 and > the discussion thread > https://postgr.es/m/alpine.DEB.2.21.1904061826420.3678@lancre > mentioned in the commit message to find that out. At first glance it > looks like an oversight, but I haven't reviewed the commit and thread > myself. That thread might reveal why generate_series() was used > instead of BINARY COPY for server side data generation. If it needs to > change it's better to start a separate thread and separate patch for > that discussion. Thank you for this hint. I went through whole thread and there they discuss how to reflect certain behavior of init-steps and nothing about COPY BINARY. Major point of generate_series() introduction is to send short query to DB and not to worry about network performance. It is quite true that COPY sends tons of data over network and it might be an issue for slow network. They also touched on topic of "one huge transaction" for whole generated dataset or few smaller transaction. Allow me to repost my benchmarks here (as it was lost for pgsql-hasckers because I just used Reply instead of Reply-To-All) Tests: Test | Binary | Init mode | Query and details -----|----------|-----------|------------------------------------------------------- 1 | original | G | INSERT FROM generate_series in single huge transaction 2 | enhanced | G | INSERT FROM generate_series in single huge transaction 3 | enhanced | i | INSERT FROM generate_series in one transaction per scale 4 | enhanced | I | INSERT FROM unnest in one transaction per scale 5 | original | g | COPY FROM STDIN TEXT in single transaction 6 | enhanced | g | COPY FROM STDIN TEXT in single transaction 7 | enhanced | C | COPY FROM STDIN BINARY in single transaction Test | Scale and seconds to complete data generation | 1 | 2 | 10 | 20 | 100 | 200 | 1000 | 2000 -----|------|------|------|------|-------|-------|--------|-------- 1 | 0.19 | 0.37 | 2.01 | 4.34 | 22.58 | 46.64 | 245.98 | 525.99 2 | 0.30 | 0.47 | 2.18 | 4.37 | 25.38 | 56.66 | 240.89 | 482.63 3 | 0.18 | 0.39 | 2.14 | 4.19 | 23.78 | 47.63 | 240.91 | 483.19 4 | 0.18 | 0.38 | 2.17 | 4.39 | 23.68 | 47.93 | 242.63 | 487.33 5 | 0.11 | 0.22 | 1.46 | 2.95 | 15.69 | 32.86 | 154.16 | 311.00 6 | 0.11 | 0.22 | 1.43 | 2.89 | 16.01 | 29.41 | 158.10 | 307.54 7 | 0.14 | 0.12 | 0.56 | 1.16 | 6.22 | 12.70 | 64.70 | 135.58 "Original" binary is pgbench v17.6. "Enhanced" binary is pgbench 19-devel with proposed patch. As we can see another point of discussion in mentioned earlier thread on pgsql-hackers said that multi transactions for init-step do NOT bring any benefit. My numbers show some increase in performance by simply INSERT-ing data in loop with one COMMIT per "scale" on lower scales. On higher scales benefit dissapears. My guess here is quite active process WAL archiver. COPY TEXT is 36% faster than INSERT with multiple transactions. COPY BINARY is ~72% faster than INSERT with multiple transactions. At this point I'm torn between keeping old modes and logic for backward compatibility and introduction of new modes for INSERT & COPY versus simply replacing old less efficient logic with new one. Sorry for quite long response. Best regards, Boris ________________________________ From: Ashutosh Bapat <[email protected]> Sent: November 16, 2025 11:58 PM To: Boris Mironov <[email protected]> Cc: [email protected] <[email protected]> Subject: Re: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY) On Fri, Nov 14, 2025 at 8:51 PM Boris Mironov <[email protected]> wrote: > > Hi Ashutosh, > > > If there is one method that is better than all others, community will > > be more willing to accept implementation of that one method than > > multiple implementations so as to reduce maintenance burden. > > Ok then. I'll leave "COPY FROM STDIN BINARY" implementation out of 3 only. > Would you prefer to replace original COPY FROM STDIN TEXT by this > code or add it as new "init-step" (e.g., with code "c")? > TEXT copy may be useful for cross platform client side data generation. BINARY might be useful for same platform client side generation or server side generation. Just a thought, use TEXT or BINARY automatically based on where it's cross-platform or same platform setup. > I also have noted that current code doesn't prevent pgbench parameter > like "--init-steps=dtgG". It allows to run data generation step twice. > Each of these "g" and "G" will present own timing in status line. Is this > an oversight or intentional? > I would review the commit a386942bd29b0ef0c9df061392659880d22cdf43 and the discussion thread https://postgr.es/m/alpine.DEB.2.21.1904061826420.3678@lancre mentioned in the commit message to find that out. At first glance it looks like an oversight, but I haven't reviewed the commit and thread myself. That thread might reveal why generate_series() was used instead of BINARY COPY for server side data generation. If it needs to change it's better to start a separate thread and separate patch for that discussion. -- Best Wishes, Ashutosh Bapat
