On Sun, Nov 22, 2020 at 4:59 PM Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > > On 11/21/20 2:55 AM, James Coleman wrote: > > Over on the -bugs list we had a report [1] of a seg fault with > > incremental sort. The short of the investigation there was that a > > subplan wasn't being serialized since it wasn't parallel safe, and > > that attempting to initialize that subplan resulted in the seg fault. > > Tom pushed a change to raise an ERROR when this occurs so that at > > least we don't crash the server. > > > > There was some small amount of discussion about this on a thread about > > a somewhat similar issue [2] (where volatile expressions were the > > issue), but that thread resulted in a committed patch, and beyond that > > it seems that this issue deserves its own discussion. > > > > I've been digging further into the problem, and my first question was > > "why doesn't this result in a similar error but with a full sort when > > incremental sort is disabled?" After all, we have code to consider a > > gather merge + full sort on the cheapest partial path in > > generate_useful_gather_paths(), and the plan that I get on Luis's test > > case with incremental sort disabled has an full sort above a gather, > > which presumably it'd be cheaper to push down into the parallel plan > > (using gather merge instead of gather). > > > > It turns out that there's a separate bug here: I'm guessing that in > > the original commit of generate_useful_gather_paths() we had a > > copy/paste thinko in this code: > > > > /* > > * If the path has no ordering at all, then we can't use either > > * incremental sort or rely on implicit sorting with a gather > > * merge. > > */ > > if (subpath->pathkeys == NIL) > > continue; > > > > The comment claims that we should skip subpaths that aren't sorted at > > all since we can't possibly use a gather merge directly or with an > > incremental sort applied first. It's true that we can't do either of > > those things unless the subpath is already at least partially sorted. > > But subsequently we attempt to construct a gather merge path on top of > > a full sort (for the cheapest path), and there's no reason to limit > > that to partially sorted paths (indeed in the presence of incremental > > sort it seems unlikely that that would be the cheapest path). > > > > We still don't want to build incremental sort paths if the subpath has > > no pathkeys, but that will imply presorted_keys = 0, which we already > > check. > > > > So 0001 removes that branch entirely. And, as expected, we now get a > > gather merge + full sort the resulting error about subplan not being > > initialized. > > > > Yeah, that's clearly a thinko in generate_useful_gather_paths. Thanks > for spotting it! The 0001 patch seems fine to me. > > > With that oddity out of the way, I started looking at the actually > > reported problem, and nominally issue is that we have a correlated > > subquery (in the final target list and the sort clause), and > > correlated subqueries (not sure why exactly in this case; see [3]) > > aren't considered parallel-safe. > > > > As to why that's happening: > > 1. find_em_expr_usable_for_sorting_rel doesn't exclude that > > expression, and arguably correctly so in the general case since one > > might (though we don't now) use this same kind of logic for > > non-parallel plans. > > 2. generate_useful_pathkeys_for_relation is noting that it'd be useful > > to sort on that expression and sees it as safe in part due to the (1). > > 3. We create a sort node that includes that expression in its output. > > That seems pretty much the same (in terms of safety given generalized > > input paths/plans, not in terms of what Robert brought up in [4]) as > > what would happen in the prepare_sort_from_pathkeys call in > > create_gather_merge_plan. > > > > So to fix this problem I've added the option to find_em_expr_for_rel > > to restrict to parallel-safe expressions in 0002. > > > > OK, that seems like a valid fix. I wonder if we can have EC with members > mixing parallel-safe and parallel-unsafe expressions? I guess no, but > it's more a feeling than a clear reasoning and so I wonder what would > happen in such cases.
An immediately contrived case would be something like "t.x = unsafe(t.y)". As to whether that can actually result in us wanting to sort by the safe member before we can execute the unsafe member (e.g., in a filter), I'm less sure, but it seems at least plausible to me (or I don't see anything inherently preventing it). > > Given point 3 above (and the discussion with Robert earlier) above it > > seems to me that there's a bigger problem here that just hasn't > > happened to have been exposed yet: in particular the > > prepare_sort_from_pathkeys call in create_gather_merge_plan seems > > suspect to me. But it's also possible other usages of > > prepare_sort_from_pathkeys could be suspect given other seemingly > > unrelated changed to path generation, since nothing other than > > implicit knowledge about the conventions here prevents us from > > introducing paths generating sorts with expressions not in the target > > list (in fact a part of the issue here IMO is that non-var expression > > pathkeys are added to the target list after path generation and > > selection is completed). > > > > Alternatively we could "fix" this by being even more conservative in > > find_em_expr_usable_for_sorting_rel and disregard any expressions not > > currently found in the target list. But that seems unsatisfactory to > > me since, given we know that there are cases where > > prepare_sort_from_paths is explicitly intended to add pathkey > > expressions to the target list, we'd be excluding possible useful > > cases (and indeed we already have, albeit contrived, test cases that > > fail if that change is made). > > > > There exists a very similar path in the postgres fdw code (in fact the > > function name is the same: get_useful_pathkeys_for_relation) since > > find_em_expr_for_rel is much simpler (it doesn't do many safety checks > > at all like we do in find_em_expr_usable_for_sorting_rel), but I think > > that's safe (though I haven't thought about it a ton) because I > > believe those are sort keys we're going to send to the foreign server > > anyway, as opposed to sorting ourselves (but I haven't verified that > > that's the case and that we never create sort nodes there). > > > > Yeah. I think the various FDW-related restrictions may easily make that > safe, for the reasons you already mentioned. But also because IIRC FDWs > in general are not parallel-safe, so there's no risk of running foreign > scans under Gather [Merge]. There are some comments (IIRC in the parallel docs) about FDWs being able to be marked as parallel-safe, but I'm not sure if that facility is in play. Either way, it's quite a different case. > Thanks for the patches, I'll take a closer look next week. The 0002 > patch is clearly something we need to backpatch, not sure about 0001 as > it essentially enables new behavior in some cases (Sort on unsorted > paths below Gather Merge). Thanks for taking a look. I had the same question re: backporting. On the one hand it will change behavior (this is always a bit of a gray area since in one sense bugs change behavior), but IMO it's not a new feature, because the code clearly intended to have that feature in the first place (it creates gather merges on top of a full sort). So I'd lean towards considering it a bug fix, but I'm also not going to make that a hill to die on. One semi-related question: do you think we should add a comment to prepare_sort_from_pathkeys explaining that modifying it may mean find_em_expr_for_rel needs to be updated also? James