Amit Kapila <amit.kapil...@gmail.com> writes: > On Thu, Jan 9, 2020 at 11:15 AM Tom Lane <t...@sss.pgh.pa.us> wrote: >> Noah Misch <n...@leadboat.com> writes: >>> Even so, a web search for "extend_brk" led to the answer. By default, >>> 32-bit >>> AIX binaries get only 256M of RAM for stack and sbrk. The new regression >>> test >>> used more than that, hence this crash.
>> Hm, so >> (1) Why did we get a crash and not some more-decipherable out-of-resources >> error? Can we improve that experience? >> (2) Should we be dialing back the resource consumption of this test? > In HEAD, we have a guc variable 'logical_decoding_work_mem' by which > we can control the memory usage of changes and we have used that, but > for back branches, we don't have such a control. I poked into this a bit more by running the src/test/recovery tests under restrictive ulimit settings. I used ulimit -s 1024 ulimit -v 250000 (At least on my 64-bit RHEL6 box, reducing ulimit -v much below this causes initdb to fail, apparently because the post-bootstrap process tries to load all our tsearch and encoding conversion shlibs at once, and it hasn't got enough VM space to do so. Someday we may have to improve that.) I did not manage to duplicate Noah's crash this way. What I see in the v10 branch is that the new 006_logical_decoding.pl test fails, but with a clean "out of memory" error. The memory map dump that that produces fingers the culprit pretty unambiguously: ... ReorderBuffer: 223302560 total in 26995 blocks; 7056 free (3 chunks); 223295504 used ReorderBufferByXid: 24576 total in 2 blocks; 11888 free (3 chunks); 12688 used Slab: TXN: 8192 total in 1 blocks; 5208 free (21 chunks); 2984 used Slab: Change: 2170880 total in 265 blocks; 2800 free (35 chunks); 2168080 used ... Grand total: 226714720 bytes in 27327 blocks; 590888 free (785 chunks); 226123832 used The test case is only inserting 50K fairly-short rows, so this seems like an unreasonable amount of memory to be consuming for that; and even if you think it's reasonable, it clearly isn't going to scale to large production transactions. Now, the good news is that v11 and later get through 006_logical_decoding.pl just fine under the same restriction. So we did something in v11 to fix this excessive memory consumption. However, unless we're willing to back-port whatever that was, this test case is clearly consuming excessive resources for the v10 branch. We're not out of the woods either. I also observe that v12 and HEAD fall over, under these same test conditions, with a stack-overflow error in the 012_subtransactions.pl test. This seems to be due to somebody's decision to use a heavily recursive function to generate a bunch of subtransactions. Is there a good reason for hs_subxids() to use recursion instead of a loop? If there is, what's the value of using 201 levels rather than, say, 10? Anyway it remains unclear why Noah's machine got a crash instead of something more user-friendly. But the reason why it's only in the v10 branch seems non-mysterious. regards, tom lane