Hello, I've investigated a crash report of PG-Strom for a few days, then I doubt add_partial_path() can unexpectedly release dominated old partial path but still referenced by other Gather node, and it leads unexpected system crash.
Please check at the gpuscan.c:373 https://github.com/heterodb/pg-strom/blob/master/src/gpuscan.c#L373 The create_gpuscan_path() constructs a partial custom-path, then it is added to the partial_pathlist of the baserel. If both of old and new partial paths have no pathkeys and old-path has larger cost, add_partial_path detaches the old path from the list, then calls pfree() to release old_path itself. On the other hands, the baserel may have GatherPath which references the partial-path on its pathlist. Here is no check whether the old partial- paths are referenced by others, or not. To ensure my assumption, I tried to inject elog() before/after the call of add_partial_path() and just before the pfree(old_path) in add_partial_path(). ---------------------------------------------------------------- dbt3=# explain select sum(l_extendedprice * l_discount) as revenue from lineitem where l_shipdate >= date '1994-01-01' and l_shipdate < cast(date '1994-01-01' + interval '1 year' as date) and l_discount between 0.08 - 0.01 and 0.08 + 0.01 and l_quantity < 24 limit 1; INFO: GpuScan:389 GATHER(0x28f3c30), SUBPATH(0x28f3f88): {GATHERPATH :pathtype 44 :parent_relids (b 1) :required_outer (b) :parallel_aware false :parallel_safe false :parallel_workers 0 :rows 151810 :startup_cost 1000.00 :total_cost 341760.73 :pathkeys <> :subpath {PATH :pathtype 18 :parent_relids (b 1) :required_outer (b) :parallel_aware true :parallel_safe true :parallel_workers 2 :rows 63254 :startup_cost 0.00 :total_cost 325579.73 :pathkeys <>} :single_copy false :num_workers 2} INFO: add_partial_path:830 old_path(0x28f3f88) is removed WARNING: could not dump unrecognized node type: 2139062143 INFO: GpuScan:401 GATHER(0x28f3c30), SUBPATH(0x28f3f88): {GATHERPATH :pathtype 44 :parent_relids (b 1) :required_outer (b) :parallel_aware false :parallel_safe false :parallel_workers 0 :rows 151810 :startup_cost 1000.00 :total_cost 341760.73 :pathkeys <> :subpath {(HOGE)} :single_copy false :num_workers 2} ---------------------------------------------------------------- At the L389, GatherPath in the baresel->pathlist is healthy. Its subpath (0x28f3f88) is a valid T_Scan path node. Then, gpuscan.c adds a cheaper path-node so add_partial_path() considers the above subpath (0x28f3f88) is dominated by the new custom-path, and release it. So, elog() at L401 says subpath has unrecognized node type: 2139062143 == 0x7f7f7f7f that implies the memory region was already released by pfree(). Reference counter or other mechanism to tack referenced paths may be an idea to avoid unintentional release of path-node. On the other hands, it seems to me the pfree() at add_path / add_partial_path is not a serious memory management because other objects referenced by the path-node are not released here. It is sufficient if we detach dominated path-node from the pathlist / partial_pathlist. How about your opinions? Thanks, -- HeteroDB, Inc / The PG-Strom Project KaiGai Kohei <kai...@heterodb.com>