Hi Andres, On Thu, Jul 18, 2019 at 10:09 AM Andres Freund <and...@anarazel.de> wrote: > 1) How come partition routing is done outside of ExecInsert()? > > case CMD_INSERT: > /* Prepare for tuple routing if needed. */ > if (proute) > slot = ExecPrepareTupleRouting(node, estate, proute, > resultRelInfo, slot); > slot = ExecInsert(node, slot, planSlot, > estate, node->canSetTag); > /* Revert ExecPrepareTupleRouting's state change. */ > if (proute) > estate->es_result_relation_info = resultRelInfo; > break; > > That already seems like a layering violation,
The decision to move partition routing out of ExecInsert() came about when we encountered a bug [1] whereby ExecInsert() would fail to reset estate->es_result_relation_info back to the root table if it had to take an abnormal path out (early return), of which there are quite a few instances. The first solution I came up with was to add a goto label for the code to reset estate->es_result_relation_info and jump to it from the various places that do an early return, which was complained about as reducing readability. So, the solution we eventually settled on in 6666ee49f was to perform ResultRelInfos switching at a higher level. > but it's made worse by > ExecUpdate() having its partition handling solely inside - including another > call to ExecInsert(), including the surrounding partition setup code. > > And even worse, after all that, ExecInsert() still contains partitioning > code. AFAIK, it's only to check the partition constraint when necessary. Partition routing complexity is totally outside, but based on what you write in point 4 below there's bit more... > It seems to me that if we just moved the ExecPrepareTupleRouting() into > ExecInsert(), we could remove the duplication. I agree that there's duplication here. Given what I wrote above, I can think of doing this: move all of ExecInsert()'s code into ExecInsertInternal() and make the former instead look like this: static TupleTableSlot * ExecInsert(ModifyTableState *mtstate, TupleTableSlot *slot, TupleTableSlot *planSlot, EState *estate, bool canSetTag) { PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing; ResultRelInfo *resultRelInfo = estate->es_result_relation_info; /* Prepare for tuple routing if needed. */ if (proute) slot = ExecPrepareTupleRouting(mtstate, estate, proute, resultRelInfo, slot); slot = ExecInsertInternal(mtstate, slot, planSlot, estate, mtstate->canSetTag); /* Revert ExecPrepareTupleRouting's state change. */ if (proute) estate->es_result_relation_info = resultRelInfo; return slot; } > 2) The contents of the > /* > * If a partition check failed, try to move the row into the right > * partition. > */ > if (partition_constraint_failed) > > block ought to be moved to a separate function (maybe > ExecCrossPartitionUpdate or ExecMove). ExecUpdate() is already > complicated enough without dealing with the partition move. I tend to agree with this. Adding Amit Khandekar in case he wants to chime in about this. > 3) How come we reset estate->es_result_relation_info after partition > routing, but not the mtstate wide changes by > ExecPrepareTupleRouting()? Note that its comment says: > > * Caller must revert the estate changes after executing the insertion! > * In mtstate, transition capture changes may also need to be reverted. > > ExecUpdate() contains > > /* > * Updates set the transition capture map only when a new subplan > * is chosen. But for inserts, it is set for each row. So after > * INSERT, we need to revert back to the map created for UPDATE; > * otherwise the next UPDATE will incorrectly use the one created > * for INSERT. So first save the one created for UPDATE. > */ > if (mtstate->mt_transition_capture) > saved_tcs_map = mtstate->mt_transition_capture->tcs_map; > > but as I read the code, that's not really true? It's > ExecPrepareTupleRouting() that does so, and that's called directly in > ExecUpdate(). Calling ExecPrepareTupleRouting() is considered a part of a given INSERT operation, so anything it does is to facilitate the INSERT. In this case, which map to assign to tcs_map can only be determined after a partition is chosen and determining the partition (routing) is a job of ExecPrepareTupleRouting(). Perhaps, we need to update the comment here a bit. > 4) > /* > * If this insert is the result of a partition key update that moved the > * tuple to a new partition, put this row into the transition NEW TABLE, > * if there is one. We need to do this separately for DELETE and INSERT > * because they happen on different tables. > */ > ar_insert_trig_tcs = mtstate->mt_transition_capture; > if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture > && mtstate->mt_transition_capture->tcs_update_new_table) > { > ExecARUpdateTriggers(estate, resultRelInfo, NULL, > NULL, > slot, > NULL, > mtstate->mt_transition_capture); > > /* > * We've already captured the NEW TABLE row, so make sure any AR > * INSERT trigger fired below doesn't capture it again. > */ > ar_insert_trig_tcs = NULL; > } > > /* AFTER ROW INSERT Triggers */ > ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes, > ar_insert_trig_tcs); > > Besides not using the just defined ar_insert_trig_tcs and instead > repeatedly referring to mtstate->mt_transition_capture, wouldn't this be > a easier to understand if the were an if/else, instead of resetting > ar_insert_trig_tcs? If the block were > > /* > * triggers behave differently depending on this being a delete as > * part of a partion move, or a deletion proper. > if (mtstate->operation == CMD_UPDATE) > { > /* > * If this insert is the result of a partition key update that moved > the > * tuple to a new partition, put this row into the transition NEW > TABLE, > * if there is one. We need to do this separately for DELETE and > INSERT > * because they happen on different tables. > */ > ExecARUpdateTriggers(estate, resultRelInfo, NULL, > NULL, > slot, > NULL, > mtstate->mt_transition_capture); > > /* > * But we do want to fire plain per-row INSERT triggers on the > * new table. By not passing in transition_capture we prevent > * .... > */ > ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes, > NULL); > } > else > { > /* AFTER ROW INSERT Triggers */ > ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes, > ar_insert_trig_tcs); > } Maybe you meant to use mtstate->mt_transition_capture instead of ar_insert_trig_tcs in the else block. We don't need ar_insert_trig_tcs at all. > it seems like it'd be quite a bit clearer (although I do think the > comments also need a fair bit of polishing independent of this proposed > change). Fwiw, I agree with your proposed restructuring, although I'd let Amit Kh chime in as he'd be more familiar with this code. I wasn't aware of this partitioning-related bit being present in ExecInsert(). Would you like me to write a patch for some or all items? Thanks, Amit [1] https://www.postgresql.org/message-id/flat/0473bf5c-57b1-f1f7-3d58-455c2230bc5f%40lab.ntt.co.jp