Hello, On Mon, Nov 13, 2023 at 6:01 PM Andres Freund <and...@anarazel.de> wrote:
> Hi, > > On 2023-11-13 11:55:07 -0600, Adam Hendel wrote: > > Currently, pgbench will log individual transactions to a logfile when the > > `--log` parameter flag is provided. The logfile, however, does not > include > > column header. It has become a fairly standard expectation of users to > have > > column headers present in flat files. Without the header in the pgbench > log > > files, new users must navigate to the docs and piece together the column > > headers themselves. Most industry leading frameworks have tooling built > in > > to read column headers though, for example python/pandas read_csv(). > > The disadvantage of doing that is that a single pgbench run with --log will > generate many files when using -j, to avoid contention. The easiest way to > deal with that after the run is to cat all the log files to a larger one. > If > you do that with headers present, you obviously have a few bogus rows (the > heades from the various files). > > I guess you could avoid the "worst" of that by documenting that the > combined > log file should be built by > cat {$log_prefix}.${pid} {$log_prefix}.${pid}.* > and only outputting the header in the file generated by the main thread. > > > > We can improve the experience for users by adding column headers to > pgbench > > logfiles with an optional command line flag, `--log-header`. This will > keep > > the API backwards compatible by making users opt-in to the column > headers. > > It follows the existing pattern of having conditional flags in pgbench’s > > API; the `--log` option would have both –log-prefix and –log-header if > this > > work is accepted. > > > The implementation considers the column headers only when the > > `--log-header` flag is present. The values for the columns are taken > > directly from the “Per-Transaction Logging” section in > > https://www.postgresql.org/docs/current/pgbench.html and takes into > account > > the conditional columns `schedule_lag` and `retries`. > > > > > > Below is an example of what that logfile will look like: > > > > > > pgbench postgres://postgres:postgres@localhost:5432/postgres --log > > --log-header > > > > client_id transaction_no time script_no time_epoch time_us > > Independent of your patch, but we imo ought to combine time_epoch time_us > in > the log output. There's no point in forcing consumers to combine those > fields, > and it makes logging more expensive... And if we touch this, we should > just > switch to outputting nanoseconds instead of microseconds. > > It also is quite odd that we have "time" and "time_epoch", "time_us", where > time is the elapsed time of an individual "transaction" and time_epoch + > time_us together are a wall-clock timestamp. Without differentiating > between > those, the column headers seem not very useful, because one needs to look > in > the documentation to understand the fields. > > > I don't think there's all that strong a need for backward compatibility in > pgbench. We could just change the columns as I suggest above and then > always > emit the header in the "main" log file. > I updated the patch to always log the header and only off the main thread. As for the time headers, I will work on renaming/combining those in a separate commit as: time -> time_elapsed time_epoch + time_us -> time_completion_us > Greetings, > > Andres Freund >
v2-0001-Adds-colum-headers-to-the-log-file-produced-by-pg.patch
Description: Binary data