Usually I see printtup in the perf-report with a noticeable ratio. Take "SELECT * FROM pg_class" for example, we can see:
85.65% 3.25% postgres postgres [.] printtup The high level design of printtup is: 1. Used a pre-allocated StringInfo DR_printtup.buf to store data for each tuples. 2. for each datum in the tuple, it calls the type-specific out function and get a cstring. 3. after get the cstring, we figure out the "len" and add both len and 'data' into DR_printtup.buf. 4. after all the datums are handled, socket_putmessage copies them into PqSendBuffer. 5. When the usage of PgSendBuffer is up to PqSendBufferSize, using send syscall to sent them into client (by copying the data from userspace to kernel space again). Part of the slowness is caused by "memcpy", "strlen" and palloc in outfunction. 8.35% 8.35% postgres libc.so.6 [.] __strlen_avx2 4.27% 0.00% postgres libc.so.6 [.] __memcpy_avx_unaligned_erms 3.93% 3.93% postgres postgres [.] palloc (part of them caused by out function) 5.70% 5.70% postgres postgres [.] AllocSetAlloc (part of them caused by printtup.) My high level proposal is define a type specific print function like: oidprint(Datum datum, StringInfo buf) textprint(Datum datum, StringInfo buf) This function should append both data and len into buf directly. for the oidprint case, we can avoid: 5. the dedicate palloc in oid function. 6. the memcpy from the above memory into DR_printtup.buf for the textprint case, we can avoid 7. strlen, since we can figure out the length from varlena.vl_len int2/4/8/timestamp/date/time are similar with oid. and numeric, varchar are similar with text. This almost covers all the common used type. Hard coding the relationship between common used type and {type}print function OID looks not cool, Adding a new attribute in pg_type looks too aggressive however. Anyway this is the next topic to talk about. If a type's print function is not defined, we can still using the out function (and PrinttupAttrInfo caches FmgrInfo rather than FunctionCallInfo, so there is some optimization in this step as well). This proposal covers the step 2 & 3. If we can do something more aggressively, we can let the xxxprint print to PqSendBuffer directly, but this is more complex and need some infrastructure changes. the memcpy in step 4 is: "1.27% __memcpy_avx_unaligned_erms" in my above case. What do you think? -- Best Regards Andy Fan