Hi, On Thu, 11 Jan 2024 at 08:01, Michael Paquier <mich...@paquier.xyz> wrote: > > On Wed, Jan 10, 2024 at 07:24:50PM -0500, Melanie Plageman wrote: > > I have code review feedback as well, but I've saved that for my next email. > > Ah, cool. > > > On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz <byavu...@gmail.com> wrote: > >> On Sun, 31 Dec 2023 at 03:58, Michael Paquier <mich...@paquier.xyz> wrote: > >> Oh, I understand it now. Yes, that makes sense. > >> I thought removing op_bytes completely ( as you said "This patch > >> extends it with two more operation sizes, and there are even cases > >> where it may be a variable" ) from pg_stat_io view then adding > >> something like {read | write | extend}_bytes and {read | write | > >> extend}_calls could be better, so that we don't lose any information. > > > > Upthread, Michael says: > > > >> I find the use of 1 in this context a bit confusing, because when > >> referring to a counter at N, then it can be understood as doing N > >> times a operation, > > > > I didn't understand this argument, so I'm not sure if I agree or > > disagree with it. > > Nazir has mentioned upthread one thing: what should we do for the case > where a combination of (io_object,io_context) does I/O with a > *variable* op_bytes, because that may be the case for the WAL > receiver? For this case, he has mentioned that we should set op_bytes > to 1, but that's something I find confusing because it would mean that > we are doing read, writes or extends 1 byte at a time. My suggestion > would be to use op_bytes = -1 or NULL for the variable case instead, > with reads, writes and extends referring to a number of bytes rather > than a number of operations.
I agree but we can't do this only for the *variable* cases since B_WAL_RECEIVER and other backends use the same pgstat_count_io_op_time(IOObject, IOContext, ...) call. What I mean is, if two backends use the same pgstat_count_io_op_time() function call in the code; they must count the same thing (number of calls, bytes, etc.). It could be better to count the number of bytes for all the IOOBJECT_WAL IOs. > > I think these are the three proposals for handling WAL reads: > > > > 1) setting op_bytes to 1 and the number of reads is the number of bytes > > 2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the > > number of calls to pg_pread() or similar > > 3) setting op_bytes to NULL and the number of reads is the number of > > calls to pg_pread() or similar > > 3) could be a number of bytes, actually. One important point is that we can't change only reads, if we decide to count the number of bytes for the reads; writes and extends should be counted as a number of bytes as well. > > Looking at the patch, I think it is still doing 2. > > The patch disables stats for the WAL receiver, while the startup > process reads WAL with XLOG_BLCKSZ, so yeah that's 2) with a trick to > discard the variable case. > > > For an unpopular idea: we could add separate [IOOp]_bytes columns for > > all those IOOps for which it would be relevant. It kind of stinks but > > it would give us the freedom to document exactly what a single IOOp > > means for each combination of BackendType, IOContext, IOObject, and > > IOOp (as relevant) and still have an accurate number in the *bytes > > columns. Everyone will probably hate us if we do that, though. > > Especially because having bytes for the existing IOObjects is an > > existing feature. > > An issue I have with this one is that having multiple tuples for > each (object,context) if they have multiple op_bytes leads to > potentially a lot of bloat in the view. That would be up to 8k extra > tuples just for the sake of op_byte's variability. Yes, that doesn't seem applicable to me. > > A separate question: suppose [1] goes in (to read WAL from WAL buffers > > directly). Now, WAL reads are not from permanent storage anymore. Are > > we only tracking permanent storage I/O in pg_stat_io? I also had this > > question for some of the WAL receiver functions. Should we track any > > I/O other than permanent storage I/O? Or did I miss this being > > addressed upthread? > > That's a good point. I guess that this should just be a different > IOOp? That's not a IOOP_READ. A IOOP_HIT is also different. I think different IOContext rather than IOOp suits better for this. > > In terms of what I/O we should track in a streaming/asynchronous > > world, the options would be: > > > > 1) track read/write syscalls > > 2) track blocks of BLCKSZ submitted to the kernel > > 3) track bytes submitted to the kernel > > 4) track merged I/Os (after doing any merging in the application) > > > > I think the debate was largely between 2 and 4. There was some > > disagreement, but I think we landed on 2 because there is merging that > > can happen at many levels in the storage stack (even the storage > > controller). Distinguishing between whether or not Postgres submitted > > 2 32k I/Os or 8 8k I/Os could be useful while you are developing AIO, > > but I think it might be confusing for the Postgres user trying to > > determine why their query is slow. It probably makes the most sense to > > still track in block size. > > > > No matter what solution we pick, you should get a correct number if > > you multiply op_bytes by an IOOp (assuming nothing is NULL). Or, > > rather, there should be some way of getting an accurate number in > > bytes of the amount of a particular kind of I/O that has been done. > > Yeah, coming back to op_bytes = -1/NULL as a tweak to mean that reads, > writes or extends are counted as bytes, because we don't have a fixed > operation size for some (object,context) cases. Can't we use 2 and 3 together? For example, use 3 for the IOOBJECT_WAL IOs and 2 for the other IOs. -- Regards, Nazir Bilal Yavuz Microsoft