On 5/28/23 08:48, Thomas Munro wrote:
Alright, since I had some time to kill in an airport, here is a
starter patch for initdb --rel-segsize.
I've gone through this patch and it looks pretty good to me. A few things:
+ * rel_setment_size, we will truncate the K+1st segment
to 0 length
rel_setment_size -> rel_segment_size
+ * We used a phony GUC with a custome show function, because we don't
custome -> custom
+ if (strcmp(endptr, "kB") == 0)
Why kB here instead of KB to match MB, GB, TB below?
+ int64 relseg_size; /* blocks per segment of large relation
*/
This will require PG_CONTROL_VERSION to be bumped -- but you are
probably waiting until commit time to avoid annoying conflicts, though I
don't think it is as likely as with CATALOG_VERSION_NO.
Some random thoughts:
Another potential option name would be --segsize, if we think we're
going to use this for temp files too eventually.
I feel like temp file segsize should be separately configurable for the
same reason that we are leaving it as 1GB for now.
Maybe it's not so beautiful to have that global variable
rel_segment_size (which replaces REL_SEGSIZE everywhere).
Maybe not, but it is the way these things are done in general, .e.g.
wal_segment_size, so I don't think it will be too controversial.
Another
idea would be to make it static in md.c and call smgrsetsegmentsize(),
or something like that. That could be a nice place to compute the
"shift" value up front, instead of computing it each time in
blockno_to_segno(), but that's probably not worth bothering with (?).
BSR/LZCNT/CLZ instructions are pretty fast on modern chips. That's
about the only place where someone could say that this change makes
things worse for people not interested in the new feature, so I was
careful to get rid of / and % operations with no-longer-constant RHS.
Right -- not sure we should be troubling ourselves with trying to
optimize away ops that are very fast, unless they are computed trillions
of times.
I had to promote segment size to int64 (global variable, field in
control file), because otherwise it couldn't represent
--rel-segsize=32TB (it'd be too big by one). Other ideas would be to
store the shift value instead of the size, or store the max block
number, eg subtract one, or use InvalidBlockNumber to mean "no limit"
(with more branches to test for it). The only problem I ran into with
the larger type was that 'SHOW segment_size' now needs a custom show
function because we don't have int64 GUCs.
A custom show function seems like a reasonable solution here.
A C type confusion problem that I noticed: some code uses BlockNumber
and some code uses int for segment numbers. It's not really a
reachable problem for practical reasons (you'd need over 2 billion
directories and VFDs to reach it), but it's wrong to use int if
segment size can be set as low as BLCKSZ (one file per block); you
could have more segments than an int can represent. We could go for
uint32, BlockNumber or create SegmentNumber (which I think I've
proposed before, and lost track of...). We can address that
separately (perhaps by finding my old patch...)
I think addressing this separately is fine, though maybe enforcing some
reasonable minimum in initdb would be a good idea for this patch. For my
2c SEGSIZE == BLOCKSZ just makes very little sense.
Lastly, I think the blockno_to_segno(), blockno_within_segment(), and
blockno_to_seekpos() functions add enough readability that they should
be committed regardless of how this patch proceeds.
Regards,
-David