pgbench logging broken by time logic changes
Back on March 10 Thomas Munro committed and wrestled multiple reworks of the pgbench code from Fabien and the crew. The feature to synchronize startup I'm looking forward to testing now that I have a packaged beta. Variations on that problem have bit me so many times I added code last year to my pgbench processing pipeline to just throw out the first and last 10% of every data set. Before I could get to startup timing I noticed the pgbench logging output was broken via commit 547f04e7 "Improve time logic": https://www.postgresql.org/message-id/E1lJqpF-00064e-C6%40gemulon.postgresql.org A lot of things are timed in pgbench now so I appreciate the idea. Y'all started that whole big thread about sync on my birthday though and I didn't follow the details of what that was reviewed against. For the logging use case I suspect it's just broken everywhere. The two platforms I tested were PGDG Ubuntu beta1 apt install and Mac git build. Example: $ createdb pgbench $ pgbench -i -s 1 pgbench $ pgbench -S -T 1 -l pgbench $ head pgbench_log.* 0 1 1730 0 1537380 70911 0 2 541 0 1537380 71474 The epoch time is the 5th column in the output, and this week it should look like this: 0 1 1411 0 1623767029 732926 0 2 711 0 1623767029 733660 If you're not an epoch guru who recognizes what's wrong already, you might grab https://github.com/gregs1104/pgbench-tools/ and party like it's 1970 to see it: $ ~/repos/pgbench-tools/log-to-csv 1 local < pgbench_log* | head 1970-01-18 14:03:00.070911,0,1.73,1,local 1970-01-18 14:03:00.071474,0,0.541,1,local I have a lot of community oriented work backed up behind this right now, so I'm gonna be really honest. This time rework commit in its current form makes me uncomfortable at this point in the release schedule. The commit has already fought through two rounds of platform specific bug fixes. But since the buildfarm doesn't test the logging feature, that whole process is suspect. My take on the PostgreSQL way to proceed: this bug exposes that pgbench logging is a feature we finally need to design testing for. We need a new buildfarm test and then a march through a full release phase to see how it goes. Only then should we start messing with the time logic. Even if we fixed the source today on both my test platforms, I'd still be nervous that beta 2 could ship and more performance testing could fall over from this modification. And that's cutting things a little close. The fastest way to get me back to comfortable would be to unwind 547f04e7 and its associated fixes and take it back to review. I understand the intent and value; I appreciate the work so far. The big industry architecture shift from Intel to ARM has me worried about time overhead again, the old code is wonky, and in the PG15 release cycle I already have resources planned around this area. # PG15 Plans I didn't intend to roll back in after time away and go right to a revert review. But I also really don't want to start my public PG14 story documenting the reality that I had to use PG13's pgbench to generate my examples either. I can't fight much with this logging problem while also doing my planned public performance testing of PG14. I already had to push back a solid bit of Beta 1 PR from this week, some "community PG is great!" promotional blogging. Let me offer what I can commit to from Crunchy corporate. I'm about to submit multiple pgbench feature changes to the open CF starting July, with David Christiansen. We and the rest of Crunchy will happily help re-review this time change idea, its logging issues, testing, rejoin the study of platform time call overhead, and bash the whole mess into shape for PG15. I personally am looking forward to it. The commit made a functional change to the way connection time is displayed; that I can take or leave as committed. I'm not sure it can be decoupled from the rest of the changes. It did cause a small breaking pgbench output parsing problem for me, just trivial regex adjustment. That break would fit in fine with my upcoming round of submissions. -- Greg Smith greg.sm...@crunchydata.com Director of Open Source Strategy, Crunchy Data
Re: pgbench logging broken by time logic changes
On Wed, Jun 16, 2021 at 2:59 PM Fabien COELHO wrote: > I'm unhappy because I already added tap tests for time-sensitive features > (-T and others, maybe logging aggregates, cannot remember), which have > been removed because they could fail under some circonstances (eg very > very very very slow hosts), or required some special handling (a few lines > of code) in pgbench, and the net result of this is there is not a single > test in place for some features:-( > I understand your struggle and I hope I was clear about two things: -I am excited by all the progress made in pgbench, and this problem is an integration loose end rather than a developer failure at any level. -Doing better in this messy area takes a team that goes from development to release management, and I had no right to complain unless I brought resources to improve in the specific areas of the process that I want to be better. I think the only thing you and I disagree on is that you see a "first issue in a corner case" where I see a process failure that is absolutely vital for me to improve. Since the reality is that I might be the best positioned person to actually move said process forward in a meaningful long-term way, I have every intention of applying pressure to the area you're frustrated at. Crunchy has a whole parallel review team to the community one now focused on what our corporate and government customers need for software process control and procedure compliance. The primary business problem I'm working on now is how to include performance review in that mix. I already know I need to re-engage with you over how I need min/max numbers in the aggregate logging output to accomplish some valuable goals. When I get around to that this summer, I'd really enjoy talking with you a bit, video call or something, about really any community topic you're frustrated with. I have a lot riding now on the productivity of the PostgreSQL hacker community and I want everyone to succeed at the best goals. There is no problem with proposing tests, the problem is that they are > accepted, or if they are accepted then that they are not removed at the > first small issue but rather fixed, or their limitations accepted, because > testing time-sensitive features is not as simple as testing functional > features. > For 2020 Crunchy gave me a sort of sabbatical year to research community oriented benchmarking topics. Having a self contained project in my home turned out to be the perfect way to spend *that* wreck of a year. I made significant progress toward the idea of having a performance farm for PostgreSQL. On my laptop today is a 14GB database with 1s resolution latency traces for 663 days of pgbench time running 4 workloads across a small bare metal farm of various operating systems and hardware classes. I can answer questions like "how long does a typical SSD take to execute an INSERT commit?" across my farm with SQL. It's at the "works for me!" stage of development, and I thought this was the right time in the development cycle to start sharing improvement ideas from my work; thus the other submissions in progress I alluded to. The logging feature is in an intermediate spot where validating it requires light custom tooling that compares its output against known variables like the system time. It doesn't quite have a performance component to it. Since this time logic detail is a well known portability minefield, I thought demanding that particular test was a pretty easy sell. That you in particular are frustrated here makes perfect sense to me. I am fresh and ready to carry this forward some distance, and I hope the outcome makes you happy
pgbench: INSERT workload, FK indexes, filler fix
Attached is a combined diff for a set of related patches to the built-in pgbench workloads. One commit adds an INSERT workload. One fixes the long standing 0 length filler issue. A new --extra-indexes option adds the indexes needed for lookups added by the --foreign-keys option. The commits are independent but overlap in goals. I'm grouping them here mainly to consolidate this message, covering the feedback leading to this particular combination plus a first review from me. More graphs etc. coming as my pgbench toolchain settles down again. Code all by David Christensen based on vague specs from me, errors probably mine, changes are also at https://github.com/pgguru/postgres/commits/pgbench-improvements David ran through the pgbench TAP regression tests and we're thinking about how to add more for changes like this. Long term that collides with performance testing for things like CREATE INDEX, which I've done some work on myself recently in pgbench-tools. After bouncing the possibilities around a little, David and I thought this specific set of changes might be the right amount of change for one PG version. Core development could bite on all these pgbench changes or even more [foreshadowing] as part of a themed rework of pgbench's workload that's known to adjust results a bit, so beware direct comparisons to old versions. That's what I'd prefer to do, a break it all at once strategy for these items and whatever else we can dig up this cycle. I'll do my usual thing to help with that, starting with more benchmark graphs of this patch and such once my pgbench toolchain settles again. To me pgbench should continue to demonstrate good PostgreSQL client behavior, and all this is just modernizing polish. Row size and indexing matter of course, but none of these changes really alter the fundamentals of pgbench results. With modern hardware acceleration, the performance drag due to the increased size of the filler is so much further down in the benchmark noise from where I started at with PG. The $750 USD AMD retail chip in my basement lab pushes 1M TPS of prepared SELECT statements over sockets. Plus or minus 84 bytes per row in a benchmark database doesn't worry me so much anymore. Seems down there with JSON overhead as a lost micro optimization fight nowadays. # Background: pgbench vs. sysbench This whole rework idea came from a performance review pass where I compared pgbench and sysbench again, as both have evolved a good bit since my last comparison. All of the software defined storage testing brewing right now is shining a brighter light on both tools lately than I've seen in a while. The goal I worked on a bit (with Joe Conway and RedHat, thank you to our sponsors) was how to make both tools closer to equal when performing similar tasks. pgbench can duplicate the basics of the sysbench OLTP workload easily enough, running custom pgbench scripts against the generated pgbench_accounts and/or the initially empty pgbench_history. Joe and I did some work on sysbench to improve its error handling to where it reconnected automatically as part of that. How to add a reconnection feature to pgbench is a struggle because of where it fits between PG's typical connection and connection pooler abstractions; different story than this one. sysbench had the basics and just needed some error handling bug fixes, which might even have made their way upstream. These three patches are the changes I thought core PG could use in parallel, as a mix of correctness, new features, and fair play in benchmarking. # INSERT workload The easiest way to measure the basic COMMIT overhead of network storage is by doing an INSERT into an empty database and seeing the latency. I've been doing that regularly since 9.1 added sync rep and that was the easiest way to test client scaling. From my perspective as an old CRUD app writer, creating a row is the main interesting operation that's not already available in pgbench. (No one has a DELETE heavy workload for very long) Some chunk of pgbench users are trying to do that job now using the built-ins, and none of the options fit well. Anything that touches the accounts table becomes heavily wrapped into the checkpoint cycle, and extracting signal from checkpoint noise is so hard dudes charge for books about it. In this context I trust INSERT results more than I do the output from pg_test_fsync, which is too low level for me to recommend as a general-purpose tool. For better or worse pgbench is a primary tool in that role to PG customers, and the INSERT scaling looks great all over. I've attached an early sample comparing 5 models of SSD to show it; what looks like a PG14 regression there is a testing artifact I'm working on. The INSERT workload is useful with or without the history indexes, which again as written here only are created if you ask for the FKs. When I do these performance studies of INSERT scaling as a new history table builds, for really no good
Re: Major pgbench synthetic SELECT workload regression, Ubuntu 23.04+PG15
On Thu, Jun 8, 2023 at 6:18 PM Andres Freund wrote: > Could you get a profile with call graphs? We need to know what leads to all > those osq_lock calls. > perf record --call-graph dwarf -a sleep 1 > or such should do the trick, if run while the workload is running. > I'm doing something wrong because I can't find the slow part in the perf data; I'll get back to you on this one. > I think it's unwise to compare builds of such different vintage. The > compiler > options and compiler version can have substantial effects. > I recommend also using -P1. Particularly when using unix sockets, the > specifics of how client threads and server threads are scheduled plays a > huge > role. Fair suggestions, those graphs come out of pgbench-tools where I profile all the latency, fast results for me are ruler flat. It's taken me several generations of water cooling experiments to reach that point, but even that only buys me 10 seconds before I can overload a CPU to higher latency with tougher workloads. Here's a few seconds of slightly updated examples, now with matching PGDG sourced 14+15 on the 5950X and with sched_autogroup_enabled=0 too: $ pgbench -S -T 10 -c 32 -j 32 -M prepared -p 5434 -P 1 pgbench pgbench (14.8 (Ubuntu 14.8-1.pgdg23.04+1)) progress: 1.0 s, 1032929.3 tps, lat 0.031 ms stddev 0.004 progress: 2.0 s, 1051239.0 tps, lat 0.030 ms stddev 0.001 progress: 3.0 s, 1047528.9 tps, lat 0.030 ms stddev 0.008... $ pgbench -S -T 10 -c 32 -j 32 -M prepared -p 5432 -P 1 pgbench pgbench (15.3 (Ubuntu 15.3-1.pgdg23.04+1)) progress: 1.0 s, 171816.4 tps, lat 0.184 ms stddev 0.029, 0 failed progress: 2.0 s, 173501.0 tps, lat 0.184 ms stddev 0.024, 0 failed... On the slow runs it will even do this, watch my 5950X accomplish 0 TPS for a second! progress: 38.0 s, 177376.9 tps, lat 0.180 ms stddev 0.039, 0 failed progress: 39.0 s, 35861.5 tps, lat 0.181 ms stddev 0.032, 0 failed progress: 40.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed progress: 41.0 s, 222.1 tps, lat 304.500 ms stddev 741.413, 0 failed progress: 42.0 s, 101199.6 tps, lat 0.530 ms stddev 18.862, 0 failed progress: 43.0 s, 98286.9 tps, lat 0.328 ms stddev 8.156, 0 failed Gonna have to measure seconds/transaction if this gets any worse. > I've seen such issues in the past, primarily due to contention internal to > cgroups, when the memory controller is enabled. IIRC that could be > alleviated > to a substantial degree with cgroup.memory=nokmem. > I cannot express on-list how much I dislike everything about the cgroups code. Let me dig up the right call graph data first and will know more then. The thing that keeps me from chasing kernel tuning too hard is seeing the PG14 go perfectly every time. This is a really weird one. All the suggestions much appreciated.
Re: Major pgbench synthetic SELECT workload regression, Ubuntu 23.04+PG15
Let me start with the happy ending to this thread: $ pgbench -S -T 10 -c 32 -j 32 -M prepared -P 1 pgbench pgbench (15.3 (Ubuntu 15.3-1.pgdg23.04+1)) progress: 1.0 s, 1015713.0 tps, lat 0.031 ms stddev 0.007, 0 failed progress: 2.0 s, 1083780.4 tps, lat 0.029 ms stddev 0.007, 0 failed... progress: 8.0 s, 1084574.1 tps, lat 0.029 ms stddev 0.001, 0 failed progress: 9.0 s, 1082665.1 tps, lat 0.029 ms stddev 0.001, 0 failed tps = 1077739.910163 (without initial connection time) Which even seems a whole 0.9% faster than 14 on this hardware! The wonders never cease. On Thu, Jun 8, 2023 at 9:21 PM Andres Freund wrote: > You might need to add --no-children to the perf report invocation, > otherwise > it'll show you the call graph inverted. > My problem was not writing kernel symbols out, I was only getting addresses for some reason. This worked: sudo perf record -g --call-graph dwarf -d --phys-data -a sleep 1 perf report --stdio And once I looked at the stack trace I immediately saw the problem, fixed the config option, and this report is now closed as PEBKAC on my part. Somehow I didn't notice the 15 installs on both systems had log_min_duration_statement=0, and that's why the performance kept dropping *only* on the fastest runs. What I've learned today then is that if someone sees osq_lock in simple perf top out on oddly slow server, it's possible they are overloading a device writing out log file data, and leaving out the boring parts the call trace you might see is: EmitErrorReport __GI___libc_write ksys_write __fdget_pos mutex_lock __mutex_lock_slowpath __mutex_lock.constprop.0 71.20% osq_lock Everyone was stuck trying to find the end of the log file to write to it, and that was the entirety of the problem. Hope that call trace and info helps out some future goofball making the same mistake. I'd wager this will come up again. Thanks to everyone who helped out and I'm looking forward to PG16 testing now that I have this rusty, embarrassing warm-up out of the way. -- Greg Smith greg.sm...@crunchydata.com Director of Open Source Strategy
Re: Major pgbench synthetic SELECT workload regression, Ubuntu 23.04+PG15
On Fri, Jun 9, 2023 at 4:06 AM Gurjeet Singh wrote: > There is no mention of perf or similar utilities in pgbench-tools > docs. I'm guessing Linux is the primary platform pgbench-tools gets > used on most. If so, I think it'd be useful to mention these tools and > snippets in there to make others lives easier. That's a good idea. I've written out guides multiple times for customers who are crashing and need to learn about stack traces, to help them becomes self-sufficient with the troubleshooting parts it's impractical for me to do for them. If I can talk people through gdb, I can teach them perf. I have a lot of time to work on pgbench-tools set aside this summer, gonna finally deprecate the gnuplot backend and make every graph as nice as the ones I shared here. I haven't been aggressive about pushing perf because a lot of customers at Crunchy--a disproportionately larger number than typical I suspect--have operations restrictions that just don't allow DBAs direct access to a server's command line. So perf commands are just out of reach before we even get to the permissions it requires. I may have to do something really wild to help them, like see if the right permissions setup would allow PL/python3 or similar to orchestrate a perf session in a SQL function.
Re: Use COPY for populating all pgbench tables
On Tue, May 23, 2023 at 1:33 PM Tristan Partin wrote: > We (Neon) have noticed that pgbench can be quite slow to populate data > in regard to higher latency connections. Higher scale factors exacerbate > this problem. Some employees work on a different continent than the > databases they might be benchmarking. By moving pgbench to use COPY for > populating all tables, we can reduce some of the time pgbench takes for > this particular step. > When latency is continent size high, pgbench should be run with server-side table generation instead of using COPY at all, for any table. The default COPY based pgbench generation is only intended for testing where the client and server are very close on the network. Unfortunately there's no simple command line option to change just that one thing about how pgbench runs. You have to construct a command line that documents each and every step you want instead. You probably just want this form: $ pgbench -i -I dtGvp -s 500 That's server-side table generation with all the usual steps. I use this instead of COPY in pgbench-tools so much now, basically whenever I'm talking to a cloud system, that I have a simple 0/1 config option to switch between the modes, and this long weird one is the default now. Try that out, and once you see the numbers my bet is you'll see extending which tables get COPY isn't needed by your use case anymore. Basically, if you are close enough to use COPY instead of server-side generation, you are close enough that every table besides accounts will not add up to enough time to worry about optimizing the little ones. -- Greg Smith greg.sm...@crunchydata.com Director of Open Source Strategy
Re: index prefetching
On Thu, Jun 8, 2023 at 11:40 AM Tomas Vondra wrote: > We already do prefetching for bitmap index scans, where the bitmap heap > scan prefetches future pages based on effective_io_concurrency. I'm not > sure why exactly was prefetching implemented only for bitmap scans At the point Greg Stark was hacking on this, the underlying OS async I/O features were tricky to fix into PG's I/O model, and both of us did much review work just to find working common ground that PG could plug into. Linux POSIX advisories were completely different from Solaris's async model, the other OS used for validation that the feature worked, with the hope being that designing against two APIs would be better than just focusing on Linux. Since that foundation was all so brittle and limited, scope was limited to just the heap scan, since it seemed to have the best return on time invested given the parts of async I/O that did and didn't scale as expected. As I remember it, the idea was to get the basic feature out the door and gather feedback about things like whether the effective_io_concurrency knob worked as expected before moving onto other prefetching. Then that got lost in filesystem upheaval land, with so much drama around Solaris/ZFS and Oracle's btrfs work. I think it's just that no one ever got back to it. I have all the workloads that I use for testing automated into pgbench-tools now, and this change would be easy to fit into testing on them as I'm very heavy on block I/O tests. To get PG to reach full read speed on newer storage I've had to do some strange tests, like doing index range scans that touch 25+ pages. Here's that one as a pgbench script: \set range 67 * (:multiplier + 1) \set limit 10 * :scale \set limit :limit - :range \set aid random(1, :limit) SELECT aid,abalance FROM pgbench_accounts WHERE aid >= :aid ORDER BY aid LIMIT :range; And then you use '-Dmultiplier=10' or such to crank it up. Database 4X RAM, multiplier=25 with 16 clients is my starting point on it when I want to saturate storage. Anything that lets me bring those numbers down would be valuable. -- Greg Smith greg.sm...@crunchydata.com Director of Open Source Strategy
Re: Use COPY for populating all pgbench tables
On Fri, Jun 9, 2023 at 1:25 PM Gurjeet Singh wrote: > > $ pgbench -i -I dtGvp -s 500 > > The steps are severely under-documented in pgbench --help output. > I agree it's not easy to find information. I just went through double checking I had the order recently enough to remember what I did. The man pages have this: > Each step is invoked in the specified order. The default is dtgvp. Which was what I wanted to read. Meanwhile the --help output says: > -I, --init-steps=[dtgGvpf]+ (default "dtgvp") %T$%%Which has the information without a lot of context for what it's used for. I'd welcome some text expansion that added a minimal but functional improvement to that the existing help output; I don't have such text.
Re: Potential ABI breakage in upcoming minor releases
On Thu, Nov 14, 2024 at 5:41 PM Noah Misch wrote: > I'm hearing the only confirmed impact on non-assert builds is the need to > recompile timescaledb. (It's unknown whether recompiling will suffice for > timescaledb. For assert builds, six PGXN extensions need recompilation.) > That matches what our build and test teams are seeing. We dug into the two lines of impacted Citus code, they are just touching columnar metadata. We dragged Marco into a late night session to double check that with the Citus columnar regression tests and look for red flags in the code. In an assert Citus built against 16.4 running against PostgreSQL 16.5, he hit the assert warnings, but the tests pass and there's no signs or suspicion of a functional impact: CREATE TABLE columnar_table_1 (a int) USING columnar; INSERT INTO columnar_table_1 VALUES (1); +WARNING: problem in alloc set Stripe Write Memory Context: detected write past chunk end in block 0x563ee43a4f10, chunk 0x563ee43a6240 +WARNING: problem in alloc set Stripe Write Memory Context: detected write past chunk end in block 0x563ee4369bb0, chunk 0x563ee436acb0 +WARNING: problem in alloc set Stripe Write Memory Context: detected write past chunk end in block 0x563ee4369bb0, chunk 0x563ee436b3c8 Thanks to everyone who's jumped in to investigate here. With the PL/Perl CVE at an 8.8, sorting out how to get that fix to everyone and navigate the breakage is very important. -- Greg Smith, Crunchy Data Director of Open Source Strategy
Re: Increase default maintenance_io_concurrency to 16
On Tue, Mar 18, 2025 at 5:04 PM Andres Freund wrote: > Is that actually a good description of what we assume? I don't know where > that > 90% is coming from? That one's all my fault. It was an attempt to curve-fit backwards why the 4.0 number Tom set with his initial commit worked as well as it did given that underlying storage was closer to 50X as slow, and I sold the idea well enough for Bruce to follow the reasoning and commit it. Back then there was a regular procession of people who measured the actual rate and wondered why there was the order of magnitude difference between those measurements and the parameter. Pointing them toward thinking in terms of the cached read percentage too did a reasonable job of deflecting them onto why the model was more complicated than it seems. I intended to follow that up with more measurements, only to lose the whole project into a non-disclosure void I have only recently escaped I agree with your observation that the underlying cost of a non-sequential read stall on cloud storage is not markedly better than the original random: sequential ratio of mechanical drives. And the PG17 refactoring to improve I/O chunking worked to magnify that further. The end of this problem I'm working on again is assembling some useful mix of workloads such that I can try changing one of these magic constants with higher confidence. My main working set so far is write performance regression test sets against the Open Street Map loading workload, that I've been blogging about, plus the old read-only queries of the SELECT-only spaced along a scale/client grid. My experiments so far have been around another Tom special, the maximum buffer usage count limit, which turned into another black hole full of work I have only recently escaped. I haven't really thought much yet about a workload set that would allow adjusting random_page_cost. On the query side we've been pretty heads down on the TPC-H and Clickbench sets. I don't have buffer internals data from those yet though, will have to add that to the work queue. -- Greg Smith Director of Open Source Strategy, Crunchy Data greg.sm...@crunchydata.com