I'm re-raising this issue (since I needed to track down why I'd reflexively saved a 512g backup of the filesystem where this first happened).
I'm 99% sure the "bad_alloc" is from LLVM. It happened multiple times on different servers (running a similar report) after setting jit=on during pg13 upgrade, and never happened since re-setting jit=off. I reproduced a memory leak, but I don't think you publicized a fix for it. I *think* the memory leak probably explains the bad_alloc. We were seeing postgres killed by OOM every ~16hours, although I don't see how that's possible from such a tiny leak. Our query also uses postgis (at first I said it didn't), so it may be that they're involved. I'd be happy to run with a prototype fix for the leak to see if the other issue does (not) recur. On Wed, Oct 07, 2020 at 04:02:42PM -0700, Andres Freund wrote: > Hi, > > Sorry for the second send of this email, but somehow I managed to mangle > the headers in the last version I sent. > > On 2020-10-03 19:01:27 -0700, Andres Freund wrote: > > On 2020-10-03 18:30:46 -0500, Justin Pryzby wrote: > > > On Sat, Oct 03, 2020 at 04:01:49PM -0700, Andres Freund wrote: > > > > > I was able to make a small, apparent leak like so. This applies to > > > > > postgis3.0/postgres12/LLVM5/centos7, and > > > > > postgis3.1alpha/postgres13/LLVM9/centos8. > > > > > > > > > > Inlining is essential to see the leak, optimization is not. Showing > > > > > every 9th > > > > > query: > > > > > > > > > > $ for a in `seq 1 99`; do echo "SET jit_inline_above_cost=1; SET > > > > > jit_optimize_above_cost=-1; SET jit_above_cost=0; SET > > > > > jit_expressions=on; SET log_executor_stats=on; SET > > > > > client_min_messages=debug; SET jit=on; explain analyze SELECT > > > > > st_x(site_geo) FROM sites WHERE site_geo IS NOT NULL;"; done |psql ts > > > > > 2>&1 |grep 'resident' |sed -n '1~9p' > > > > > ! 46024 kB max resident size > > > > > ! 46492 kB max resident size > > > > > > > > Huh, that's certainly not good. Looking. > > > > > > Reproduce like: > > > postgres=# CREATE EXTENSION postgis; > > > postgres=# CREATE TABLE geo(g geometry(point)); > > > postgres=# INSERT INTO geo SELECT st_GeometryFromText('POINT(11 12)',0) > > > FROM generate_series(1,99999); > > > postgres=# SET jit_above_cost=0; SET jit_inline_above_cost=0; SET > > > client_min_messages=debug; SET log_executor_stats=on; explain analyze > > > SELECT st_x(g) FROM geo; > > > > It's an issue inside LLVM. I'm not yet sure how to avoid it. Basically > > whenever we load an IR module to inline from, it renames its type names > > to make them not conflict with already known type names in the current > > LLVMContext. That, over time, leads to a large number of type names in > > memory. We can avoid the re-loading of the module by instead cloning it, > > but that still leads to new types being created for some internal > > reasons. It's possible to create/drop separate LLVMContext instances, > > but we'd have to figure out when to do so - it's too expensive to always > > do that. > > I spent a lot of time (probably too much time) trying to find a good way > out of that. I think I mostly figured it out, but it's gonna take me a > bit longer to nail down the loose ends. The nice part is that it > actually speeds up inlining a lot. The bad part is that it's not a small > change, so I don't yet know how to deal with existing releases. > > Details: > > The issue with the leaks is basically that when LLVM loads a bitcode > module it doesn't "unify" named types with types in other bitcode module > - they could after all be entirely independent. That means that we end > up with a lot of types after a while of running. > > That ought to be addressable by not re-loading modules we inline from > from disk every time we inline, but unfortunately that doesn't work well > either: The "cross module import" logic unfortunately modifies the types > used in the "source module", leading to subsequent imports not working > well. That last issue is why I had moved to re-loading modules from > "disk" during import. > > The partial patch I've written doesn't use the existing cross module > import code anymore, but moves to the lower level functionality it > uses. For traditional compilers, which is what that linker is mainly > written for, it makes sense to reduce memory usage by "destructively" > moving code from the source into the target - it'll only be used > once. But for the JIT case it's better to do so non-destructively. That > requires us to implement some "type resolution" between the source and > target modules, but then gains not needing to reload / copy the source > module anymore, saving a lot of the costs. > > > It's possible that for existing releases we ought to go for a simpler > approach, even if it's not as good. The only really feasible way I see > is to basically count the number of times inlining has been done, and > then just re-initialize after a certain time. But that's pretty darn > yucky. > > Greetings, > > Andres Freund