On Sat, Jan 13, 2018 at 4:32 AM, Amit Kapila <amit.kapil...@gmail.com> wrote: > Yeah, but this would mean that now with parallel create index, it is > possible that some tuples from the transaction would end up in index > and others won't.
You mean some tuples from some past transaction that deleted a bunch of tuples and committed, but not before someone acquired a still-held snapshot that didn't see the deleter's transaction as committed yet? I guess that that is different, but it doesn't matter. All that matters is that in the end, the index contains all entries for all heap tuples visible to any possible snapshot (though possibly excluding some existing old snapshots iff we detect broken HOT chains during builds). > In general, this makes me slightly nervous mainly > because such a case won't be possible without the parallel option for > create index, but if you and Robert are okay with it as there is no > fundamental problem, then we might as well leave it as it is or maybe > add a comment saying so. Let me try to explain this another way, in terms of the high-level intuition that I have about it (Robert can probably skip this part). GetOldestXmin() returns a value that is inherently a *conservative* cut-off. In hot standby mode, it's possible for the value it returns to go backwards from a value previously returned within the same backend. Even with serial builds, the exact instant that GetOldestXmin() gets called could vary based on something like the OS scheduling of the process that runs CREATE INDEX. It could have a different value based only on that. It follows that it won't matter if parallel CREATE INDEX participants have a slightly different value, because the cut-off is all about the consistency of the index with what the universe of possible snapshots could see in the heap, not the consistency of different parts of the index with each other (the parts produced from heap tuples read from each participant). Look at how the pg_visibility module calls GetOldestXmin() to recheck -- it has to call GetOldestXmin() a second time, with a buffer lock held on a heap page throughout. It does this to conclusively establish that the visibility map is corrupt (otherwise, it could just be that the cut-off became stale). Putting all of this together, it would be safe for the HEAPTUPLE_RECENTLY_DEAD case within IndexBuildHeapRangeScan() to call GetOldestXmin() again (a bit like pg_visibility does), to avoid having to index an actually-fully-dead-by-now tuple (we could call HeapTupleSatisfiesVacuum() a second time for the heap tuple, hoping to get HEAPTUPLE_DEAD the second time around). This optimization wouldn't work out a lot of the time (it would only work out when an old snapshot went away during the CREATE INDEX), and would add procarraylock traffic, so we don't do it. But AFAICT it's feasible. > Another point is that the information about broken hot chains > indexInfo->ii_BrokenHotChain is getting lost. I think you need to > coordinate this information among backends that participate in > parallel create index. Test to reproduce the problem is as below: > > create table tbrokenchain(c1 int, c2 varchar); > insert into tbrokenchain values(3, 'aaa'); > > begin; > set force_parallel_mode=on; > update tbrokenchain set c2 = 'bbb' where c1=3; > create index idx_tbrokenchain on tbrokenchain(c1); > commit; > > Now, check the value of indcheckxmin in pg_index, it should be true, > but with patch it is false. You can try with patch by not changing > the value of force_parallel_mode; Ugh, you're right. That's a real howler. Will fix. Note that my stress-testing strategy has had a lot to do with verifying that a serial build has relfiles that are physically identical to parallel builds. Obviously that couldn't have caught this, because this only concerns the state of the pg_index catalog. > The patch uses both parallel_leader_participation and > force_parallel_mode, but it seems the definition is different from > what we have in Gather. Basically, even with force_parallel_mode, the > leader is participating in parallel build. I see there is some > discussion above about both these parameters and still, there is not > complete agreement on the best way forward. I think we should have > parallel_leader_participation as that can help in testing if nothing > else. I think that you're quite right that parallel_leader_participation needs to be supported for testing purposes. I had some sympathy for the idea that we should remove leader participation as a worker from the patch entirely, but the testing argument seems to clinch it. I'm fine with killing force_parallel_mode, though, because it will be possible to force the use of parallelism by using the existing parallel_workers table storage param in the next version of the patch, regardless of how small the table is. Thanks for the review. -- Peter Geoghegan