On Mon, Oct 15, 2018 at 6:09 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Sun, Sep 2, 2018 at 12:19 AM Thomas Munro > <thomas.mu...@enterprisedb.com> wrote: > > I have also pushed a new WIP version of the lower level undo log > > storage layer patch set to a public branch[1]. I'll leave the earlier > > branch[2] there because the record-level patch posted by Dilip depends > > on it for now. > > > > I have started reading the patch and have a few assorted comments > which are mentioned below. I have been involved in the high-level > design of this module and I have also shared given some suggestions > during development, but this is mainly Thomas's work with some help > from Dilip. It would be good if other members of the community also > review the design or participate in the discussion. > > Comments > ------------------ >
Some more comments/questions on the design level choices you have made in this patch and some general comments. 1. To allocate an undo log (UndoLogAllocate()), it seems first we are creating the shared memory state for an undo log, write a WAL for it, create an actual file and segment in it and write a separate WAL for it. Now imagine the system crashed after creating a shared memory state and before actually allocating an undo log segment, then it is quite possible that after recovery we will block multiple slots for undo logs without having actual undo logs for them. Apart from that writing separate WAL for them doesn't appear to be the best way to deal with it considering that we also need to write a third WAL to attach an undo log. Now, IIUC, one advantage of arranging the things this way is that we avoid dropping the tablespaces when a particular undo log exists in it. I understand that this design kind of works, but I think we should try to think of some alternatives here. You might have already thought of making it work similar to how the interaction for regular tables or temp_tablespaces works with dropping the tablespaces but decided to do something different here. Can you explain why you have made a different design choice here? 2. extend_undo_log() { .. + /* + * Flush the parent dir so that the directory metadata survives a crash + * after this point. + */ + UndoLogDirectory(log->meta.tablespace, dir); + fsync_fname(dir, true); + + /* + * If we're not in recovery, we need to WAL-log the creation of the new + * file(s). We do that after the above filesystem modifications, in + * violation of the data-before-WAL rule as exempted by + * src/backend/access/transam/README. This means that it's possible for + * us to crash having made some or all of the filesystem changes but + * before WAL logging, but in that case we'll eventually try to create the + * same segment(s) again, which is tolerated. + */ + if (!InRecovery) + { + xl_undolog_extend xlrec; + XLogRecPtr ptr; .. } I don't understand this WAL logging action. If the crash happens before or during syncing the file, then we anyway don't have WAL to replay. If it happens after WAL writing, then anyway we are sure that the extended undo log segment must be there. Can you explain how this works? 3. +static void +allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace, + UndoLogOffset end) { .. } What will happen if the transaction creating undo log segment rolls back? Do we want to have pendingDeletes stuff as we have for normal relation files? This might also help in clearing the shared memory state (undo log slots) if any. 4. +choose_undo_tablespace(bool force_detach, Oid *tablespace) { .. /* + * Take the tablespace create/drop lock while we look the name up. + * This prevents the tablespace from being dropped while we're trying + * to resolve the name, or while the called is trying to create an + * undo log in it. The caller will have to release this lock. + */ + LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE); .. This appears quite expensive, for selecting an undo log to attach, we might need to wait for an unrelated tablespace create/drop. Have you considered any other ideas to prevent this? How other callers of get_tablespace_oid prevent it from being dropped? If we don't find any better solution, then I think at the very least we should start a separate thread to know the opinion of others on this matter. I think this is somewhat related to point-1. 5. +static inline Oid +UndoRecPtrGetTablespace(UndoRecPtr urp) +{ + UndoLogNumber logno = UndoRecPtrGetLogNo (urp); + UndoLogTableEntry *entry; + + /* + * Fast path, for undo logs we've seen before. This is safe because + * tablespaces are constant for the lifetime of an undo log number. + */ + entry = undologtable_lookup (undologtable_cache, logno); + if (likely(entry)) + return entry->tablespace; + + /* + * Slow path: force cache entry to be created. Raises an error if the + * undo log has been entirely discarded, or hasn't been created yet. That + * is appropriate here, because this interface is designed for accessing + * undo pages via bufmgr, and we should never be trying to access undo + * pages that have been discarded. + */ + UndoLogGet(logno, false); It seems UndoLogGet() probes hash table first, so what is the need for doing it in the caller and if you think it is better to perform in the caller, then maybe we should avoid doing it inside UndoLogGet()->get_undo_log()->undologtable_lookup(). 6. +get_undo_log(UndoLogNumber logno, bool locked) { .. + /* + * If we didn't find it, then it must already have been entirely + * discarded. We create a negative cache entry so that we can answer + * this question quickly next time. + * + * TODO: We could track the lowest known undo log number, to reduce + * the negative cache entry bloat. + */ + if (result == NULL) + { + /* + * Sanity check: the caller should not be asking about undo logs + * that have never existed. + */ + if (logno >= shared->next_logno) + elog(ERROR, "undo log %u hasn't been created yet", logno); + entry = undologtable_insert (undologtable_cache, logno, &found); + entry->number = logno; + entry->control = NULL; + entry->tablespace = 0; + } .. } Are you planning to take care of this TODO? In any case, do we have any mechanism to clear this bloat or will it stay till the end of the session? If it is later, then I think it is important to take care of TODO. 7. +void UndoLogNewSegment(UndoLogNumber logno, Oid tablespace, int segno); +/* Redo interface. */ +extern void undolog_redo(XLogReaderState *record); You might want to add an extra line before /* Redo interface. */ following what has been done earlier in this file. 8. + * XXX For now an xl_undolog_meta object is filled in, in case it turns out + * to be necessary to write it into the WAL record (like FPI, this must be + * logged once for each undo log after each checkpoint). I think this should + * be moved out of this interface and done differently -- to review. + */ +UndoRecPtr +UndoLogAllocate(size_t size, UndoPersistence persistence) This function doesn't appear to be filling xl_undolog_meta. Am I missing something, if not, then this comments needs to be changed? 9. +static bool +choose_undo_tablespace(bool force_detach, Oid *tablespace) +{ + char *rawname; + List *namelist; + bool need_to_unlock; + int length; + int i; + + /* We need a modifiable copy of string. */ + rawname = pstrdup(undo_tablespaces); I don't see the usage of rawname outside this function, isn't it better to free it? I understand that this function won't be called frequently enough to matter, but still there is some theoretical danger if a user continuously changes undo_tablespaces. 10. +attach_undo_log(UndoPersistence persistence, Oid tablespace) { .. + /* + * For now we have a simple linked list of unattached undo logs for each + * persistence level. We'll grovel though it to find something for the Typo. /though/through 11. +attach_undo_log(UndoPersistence persistence, Oid tablespace) { .. + /* WAL-log the creation of this new undo log. */ + { + xl_undolog_create xlrec; + + xlrec.logno = logno; + xlrec.tablespace = log- >meta.tablespace; + xlrec.persistence = log->meta.persistence; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_CREATE); + } .. } Do we need to WAL log this for temporary/unlogged persistence level? 12. +choose_undo_tablespace(bool force_detach, Oid *tablespace) { .. + oid = get_tablespace_oid(name, true); .. Do we need to check permissions to see if the current user is allowed to create in this tablespace? 13. +UndoLogAllocate(size_t size, UndoPersistence persistence) { .. + log->meta.prevlogno = prevlogno; Is it okay to update meta information without lock or we should do it few lines down after taking mutex lock? If it is okay, then it is better to write a comment for the same? 14. +static void +allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace, + UndoLogOffset end) { .. + /* Flush the contents of the file to disk. */ + if (pg_fsync(fd) != 0) + elog(ERROR, "cannot fsync file \"%s\": %m", path); .. } You might want to have a wait event for this as we do have at other places where we perform fsync. 15. +allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace, + UndoLogOffset end) { .. + if (!InRecovery) + { + xl_undolog_extend xlrec; + XLogRecPtr ptr; + + xlrec.logno = logno; + xlrec.end = end; + + XLogBeginInsert(); + XLogRegisterData ((char *) &xlrec, sizeof(xlrec)); + ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_EXTEND); + XLogFlush(ptr); + } .. } Do we need it for temporary/unlogged persistence level? 16. +static void +undolog_xlog_create(XLogReaderState *record) +{ + xl_undolog_create *xlrec = (xl_undolog_create *) XLogRecGetData(record); + UndoLogControl *log; + UndoLogSharedData *shared = MyUndoLogState.shared; + + /* Create meta-data space in shared memory. */ + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE); + /* TODO: assert that it doesn't exist already? */ + log = allocate_undo_log(); + LWLockAcquire(&log->mutex, LW_EXCLUSIVE); Do we need to acquire UndoLogLock during replay? What else can be going in concurrent to this which can create a problem? 17. UndoLogAllocate() { .. + /* + * While we have the lock, check if we have been forcibly detached by + * DROP TABLESPACE. That can only happen between transactions (see + * DropUndoLogsInsTablespace()). + */ .. } Function name in above comment is wrong. 18. + { + {"undo_tablespaces", PGC_USERSET, CLIENT_CONN_STATEMENT, + gettext_noop("Sets the tablespace(s) to use for undo logs."), + NULL, + GUC_LIST_INPUT | GUC_LIST_QUOTE + }, + &undo_tablespaces, + "", + check_undo_tablespaces, assign_undo_tablespaces, NULL + }, It seems you need to update variable_is_guc_list_quote for this variable. Till now, I have mainly reviewed undo log allocation part. This is a big patch and can take much more time to complete the review. I will review the other parts of the patch later. I have changed the status of this CF entry as "Waiting on Author", feel free to change it once you think all the comments are addressed. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com