On Thu, Apr 21, 2016 at 11:00 PM, Noah Misch <n...@leadboat.com> wrote: > On Mon, Apr 18, 2016 at 05:48:17PM +0300, Teodor Sigaev wrote: >> >>Added, see attached patch (based on v3.1) >> > >> >With this applied, I am getting a couple errors I have not seen before >> >after extensive crash recovery testing: >> >ERROR: attempted to delete invisible tuple >> >ERROR: unexpected chunk number 1 (expected 2) for toast value >> >100338365 in pg_toast_16425 >> Huh, seems, it's not related to GIN at all... Indexes don't play with toast >> machinery. The single place where this error can occur is a heap_delete() - >> deleting already deleted tuple. > > Like you, I would not expect gin_alone_cleanup-4.patch to cause such an error. > I get the impression Jeff has a test case that he had run in many iterations > against the unpatched baseline. I also get the impression that a similar or > smaller number of its iterations against gin_alone_cleanup-4.patch triggered > these two errors (once apiece, or multiple times?). Jeff, is that right?
Because the unpatched baseline suffers from the bug which was the original topic of this thread, I haven't been able to test against the original baseline. It would fail from that other bug before it ran long enough to hit these ones. Both errors start within a few minutes of each other, but do not appear to be related other than that. Once they start happening, they occur repeatedly. > Could you describe the test case in sufficient detail for Teodor to reproduce > your results? I spawn a swarm of processes to update a counter in a randomly chosen row, selected via the gin index. They do this as fast as possible until the server intentionally crashes. When it recovers, they compare notes and see if the results are consistent. But in this case, the problem is not inconsistent results, but rather errors during the updating stage. The patch introduces a mechanism to crash the server, a mechanism to fast-forward the XID, and some additional logging that I sometimes find useful (which I haven't been using in this case, but don't want to rip it out) The perl script implements the core of the test harness. The shell script sets up the server (using hard coded paths for the data and the binaries, so will need to be changed), and then calls the perl script in a loop. Running on an 8 core system, I've never seen it have a problem in less than 9 hours of run time. This produces copious amounts of logging to stdout. This is how I look through the log files to find this particular problem: sh do.sh >& do.err & tail -f do.err | fgrep ERROR The more I think about it, the more I think gin is just an innocent bystander, for which I just happen to have a particularly demanding test. I think something about snapshots and wrap-around may be broken. Cheers, Jeff
count.pl
Description: Binary data
crash_REL9_6.patch
Description: Binary data
do.sh
Description: Bourne shell script
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers