There was a previous thread[1], but I think it needs some wider discussion.
I brought up an issue where GCC in combination with FORTIFY_SOURCE[2] causes a perf regression for logical tapes after introducing LogicalTapeSetExtend()[3]. Unfortunately, FORTIFY_SOURCE is used by default on ubuntu. I have not observed the problem with clang. There is no reason why the change should trigger the regression, but it does. The slowdown is due to GCC switching to an inlined version of memcpy() for LogicalTapeWrite() at logtape.c:768. The change[3] seems to have little if anything to do with that. GCC's Object Size Checking[4] doc says: "There are built-in functions added for many common string operation functions, e.g., for memcpy __builtin___memcpy_chk built-in is provided. This built-in has an additional last argument, which is the number of bytes remaining in the object the dest argument points to or (size_t) -1 if the size is not known. The built-in functions are optimized into the normal string functions like memcpy if the last argument is (size_t) -1 or if it is known at compile time that the destination object will not be overflowed..." In other words, if GCC knows the size of the object it tries to either verify at compile time that it will never overflow, or it inserts a runtime check. But if it doesn't know the size of the object, there's nothing it can do so it just uses memcpy() like normal. Knowing the destination buffer size at compile time would be impossible (before or after my change) because palloc() doesn't have the alloc_size attribute[5] specified. Even if it is specified (which I tried), and if the compiler was smart enough (which it's not), it could still only come up with a maximum size because the offset changes at runtime. Regardless, I tried printing out the results of: __builtin_object_size (lt->buffer + lt->pos, 0) and the result is always -1 (unknown). I have attached a workaround patch which restores the performance, and it's isolatted to logtape.c, but it's ugly (and not a little bit). The questions are: 1. Is my analysis correct? 2. What is the scale of this problem? What about other platforms or compilers? Are there other cases in PostgreSQL that might suffer from the use of FORTIFY_SOURCE? 3. Even if this is the compiler's fault, should we still fix it? 4. Does the attached fix have any dangers of regressing on other compilers/platforms? 5. Does anyone have a suggestion for a better fix? Regards, Jeff Davis [1] https://postgr.es/m/91ca648cfd1f99bf07981487a7d81a1ec926caad.ca...@j-davis.com [2] https://fedoraproject.org/wiki/Security_Features?rd=Security/Features#Compile_Time_Buffer_Checks_.28FORTIFY_SOURCE.29 [3] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=24d85952 [4] https://gcc.gnu.org/onlinedocs/gcc/Object-Size-Checking.html [5] https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-alloc_005fsize-function-attribute
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c index 48ae0de305b..48f32cbe86f 100644 --- a/src/backend/utils/sort/logtape.c +++ b/src/backend/utils/sort/logtape.c @@ -153,6 +153,19 @@ typedef struct LogicalTape int nbytes; /* total # of valid bytes in buffer */ } LogicalTape; +/* + * It would be more natural to simply use an ordinary pointer in + * LogicalTapeSet, and allocate many elements for it. Unfortunately, some + * compilers (notably GCC with FORTIFY_SOURCE specified) generate signficantly + * slower code when that's done, and using this dummy structure is a + * workaround for that. + */ +typedef struct LogicalTapeDummy +{ + int dummy; + LogicalTape tape_array[FLEXIBLE_ARRAY_MEMBER]; +} LogicalTapeDummy; + /* * This data structure represents a set of related "logical tapes" sharing * space in a single underlying file. (But that "file" may be multiple files @@ -192,7 +205,7 @@ struct LogicalTapeSet /* The array of logical tapes. */ int nTapes; /* # of logical tapes in set */ - LogicalTape *tapes; /* has nTapes nentries */ + LogicalTapeDummy *tapes; /* has nTapes nentries */ }; static void ltsWriteBlock(LogicalTapeSet *lts, long blocknum, void *buffer); @@ -479,7 +492,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared, BufFile *file; int64 filesize; - lt = <s->tapes[i]; + lt = <s->tapes->tape_array[i]; pg_itoa(i, filename); file = BufFileOpenShared(fileset, filename); @@ -616,10 +629,12 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset, lts->freeBlocks = (long *) palloc(lts->freeBlocksLen * sizeof(long)); lts->nFreeBlocks = 0; lts->nTapes = ntapes; - lts->tapes = (LogicalTape *) palloc(ntapes * sizeof(LogicalTape)); + lts->tapes = (LogicalTapeDummy *) palloc( + offsetof(LogicalTapeDummy, tape_array) + + ntapes * sizeof(LogicalTape)); for (i = 0; i < ntapes; i++) - ltsInitTape(<s->tapes[i]); + ltsInitTape(<s->tapes->tape_array[i]); /* * Create temp BufFile storage as required. @@ -656,7 +671,7 @@ LogicalTapeSetClose(LogicalTapeSet *lts) BufFileClose(lts->pfile); for (i = 0; i < lts->nTapes; i++) { - lt = <s->tapes[i]; + lt = <s->tapes->tape_array[i]; if (lt->buffer) pfree(lt->buffer); } @@ -693,7 +708,7 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum, size_t nthistime; Assert(tapenum >= 0 && tapenum < lts->nTapes); - lt = <s->tapes[tapenum]; + lt = <s->tapes->tape_array[tapenum]; Assert(lt->writing); Assert(lt->offsetBlockNumber == 0L); @@ -779,7 +794,7 @@ LogicalTapeRewindForRead(LogicalTapeSet *lts, int tapenum, size_t buffer_size) LogicalTape *lt; Assert(tapenum >= 0 && tapenum < lts->nTapes); - lt = <s->tapes[tapenum]; + lt = <s->tapes->tape_array[tapenum]; /* * Round and cap buffer_size if needed. @@ -857,7 +872,7 @@ LogicalTapeRewindForWrite(LogicalTapeSet *lts, int tapenum) LogicalTape *lt; Assert(tapenum >= 0 && tapenum < lts->nTapes); - lt = <s->tapes[tapenum]; + lt = <s->tapes->tape_array[tapenum]; Assert(!lt->writing && !lt->frozen); lt->writing = true; @@ -886,7 +901,7 @@ LogicalTapeRead(LogicalTapeSet *lts, int tapenum, size_t nthistime; Assert(tapenum >= 0 && tapenum < lts->nTapes); - lt = <s->tapes[tapenum]; + lt = <s->tapes->tape_array[tapenum]; Assert(!lt->writing); if (lt->buffer == NULL) @@ -940,7 +955,7 @@ LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum, TapeShare *share) LogicalTape *lt; Assert(tapenum >= 0 && tapenum < lts->nTapes); - lt = <s->tapes[tapenum]; + lt = <s->tapes->tape_array[tapenum]; Assert(lt->writing); Assert(lt->offsetBlockNumber == 0L); @@ -1017,11 +1032,13 @@ LogicalTapeSetExtend(LogicalTapeSet *lts, int nAdditional) lts->nTapes += nAdditional; - lts->tapes = (LogicalTape *) repalloc( - lts->tapes, lts->nTapes * sizeof(LogicalTape)); + lts->tapes = (LogicalTapeDummy *) repalloc( + lts->tapes, + offsetof(LogicalTapeDummy, tape_array) + + lts->nTapes * sizeof(LogicalTape)); for (i = nTapesOrig; i < lts->nTapes; i++) - ltsInitTape(<s->tapes[i]); + ltsInitTape(<s->tapes->tape_array[i]); } /* @@ -1044,7 +1061,7 @@ LogicalTapeBackspace(LogicalTapeSet *lts, int tapenum, size_t size) size_t seekpos = 0; Assert(tapenum >= 0 && tapenum < lts->nTapes); - lt = <s->tapes[tapenum]; + lt = <s->tapes->tape_array[tapenum]; Assert(lt->frozen); Assert(lt->buffer_size == BLCKSZ); @@ -1118,7 +1135,7 @@ LogicalTapeSeek(LogicalTapeSet *lts, int tapenum, LogicalTape *lt; Assert(tapenum >= 0 && tapenum < lts->nTapes); - lt = <s->tapes[tapenum]; + lt = <s->tapes->tape_array[tapenum]; Assert(lt->frozen); Assert(offset >= 0 && offset <= TapeBlockPayloadSize); Assert(lt->buffer_size == BLCKSZ); @@ -1152,7 +1169,7 @@ LogicalTapeTell(LogicalTapeSet *lts, int tapenum, LogicalTape *lt; Assert(tapenum >= 0 && tapenum < lts->nTapes); - lt = <s->tapes[tapenum]; + lt = <s->tapes->tape_array[tapenum]; if (lt->buffer == NULL) ltsInitReadBuffer(lts, lt);