Responses inline below
Thanks,
reid

-----Original Message-----
From: Tom Lane <t...@sss.pgh.pa.us> 
Sent: Monday, April 19, 2021 9:54 AM
To: Reid Thompson <reid.thomp...@omnicell.com>
Cc: pgsql-general@lists.postgresql.org
Subject: Re: could not read from hash-join temporary file: SUCCESS && DB goes 
into recovery mode

Reid Thompson <reid.thomp...@omnicell.com> writes:
> Hi I'm looking for some guidance related to the subject line issue.

Is this repeatable?  If so you've found a bug, I think.

Alvaro has noted   
https://www.postgresql.org/message-id/20210419155928.GA3253%40alvherre.pgsql  
that the mentioned bug fix on the error reporting should correct this ( as the 
hash join will abort with the fix rather than not recognizing that it needs to, 
and then attempt a later read)
Somewhat -- I think it's dependent on what else may be writing tmp files ( i.e. 
it's happened twice in the past few days with successful runs in the interim 
and I believe that this query has been in place for a long time prior to the 
first occurrence)

>   1.  That the error message has been updated ( i.e. SUCCESS is not 
> the proper value)

Yeah, what this really indicates is an incomplete read (file shorter than 
expected).  Since 11.8, we've improved the error reporting for that, but that 
wouldn't in itself fix whatever the underlying problem is.  

See message noted above from Alvaro.

>   2.  That the error is due to running out of temporary space either disk 
> space or maybe temp_buffers?

That could be the proximate cause, although then there would be a bug that the 
original write failure wasn't detected.  But it seems about as likely that 
there's just some inconsistency between what the temp file writing code wrote 
and what the reading code expects to read.

Is this a parallelized hash join by any chance?  That's new in v11 if memory 
serves, so it'd be interesting to see if disabling enable_parallel_hash changes 
anything.

Explain on the host does not indicate parallelization anywhere - it does 
indicate that we can narrow the result set by optimizing our query which we are 
implementing now.  Hopefully this will get us to a scheduled upgrade timeframe.

Anyway, I'd counsel updating to current (11.11), and then if you can still 
reproduce the problem, try to reduce it to a self-contained test case.

                        regards, tom lane


Reply via email to