On Fri, 30 Aug 2024 at 03:33, Tom Lane <t...@sss.pgh.pa.us> wrote: > > David Rowley <dgrowle...@gmail.com> writes: > > [ redesign I/O function APIs ] > > I had planned to work on this for PG18, but I'd be happy for some > > assistance if you're willing. > > I'm skeptical that such a thing will ever be practical. To avoid > breaking un-converted data types, all the call sites would have to > support both old and new APIs. To avoid breaking non-core callers, > all the I/O functions would have to support both old and new APIs. > That probably adds enough overhead to negate whatever benefit you'd > get.
Scepticism is certainly good when it comes to such a large API change. I don't want to argue with you, but I'd like to state a few things about why I think you're wrong on this... So, we currently return cstrings in our output functions. Let's take jsonb_out() as an example, to build that cstring, we make a *new* StringInfoData on *every call* inside JsonbToCStringWorker(). That gives you 1024 bytes before you need to enlarge it. However, it's maybe not all bad as we have some size estimations there to call enlargeStringInfo(), only that's a bit wasteful as it does a repalloc() which memcpys the freshly allocated 1024 bytes allocated in initStringInfo() when it doesn't yet contain any data. After jsonb_out() has returned and we have the cstring, only we forgot the length of the string, so most places will immediately call strlen() or do that indirectly via appendStringInfoString(). For larger JSON documents, that'll likely require pulling cachelines back into L1 again. I don't know how modern CPU cacheline eviction works, but if it was as simple as FIFO then the strlen() would flush all those cachelines only for memcpy() to have to read them back again for output strings larger than L1. If we rewrote all of core's output functions to use the new API, then the branch to test the function signature would be perfectly predictable and amount to an extra cmp and jne/je opcode. So, I just don't agree with the overheads negating the benefits comment. You're probably off by 1 order of magnitude at the minimum and for medium/large varlena types likely 2-3+ orders. Even a simple int4out requires a palloc()/memcpy. If we were outputting lots of data, e.g. in a COPY operation, the output buffer would seldom need to be enlarged as it would quickly adjust to the correct size. For the input functions, the possible gains are extensive too. textin() is a good example, it uses cstring_to_text(), but could be changed to use cstring_to_text_with_len(). Knowing the input string length also opens the door to SIMD. Take int4in() as an example, if pg_strtoint32_safe() knew its input length there are a bunch of prechecks that could be done with either 64-bit SWAR or with SIMD. For example, if you knew you had an 8-char string of decimal digits then converting that to an int32 is quite cheap. It's impossible to overflow an int32 with 8 decimal digits, so no overflow checks need to be done until there are at least 10 decimal digits. ca6fde922 seems like good enough example of the possible gains of SIMD vs byte-at-a-time processing. I saw some queries go 4x faster there and that was me trying to keep the JSON document sizes realistic. byte-at-a-time is just not enough to saturate RAM speed. Take DDR5, for example, Wikipedia says it has a bandwidth of 32–64 GB/s, so unless we discover room temperature superconductors, we're not going to see any massive jump in clock speeds any time soon, and with 5 or 6Ghz CPUs, there's just no way to get anywhere near that bandwidth by processing byte-at-a-time. For some sort of nieve strcpy() type function, you're going to need at least a cmp and mov, even if those were latency=1 (which they're not, see [1]), you can only do 2.5 billion of those two per second on a 5Ghz processor. I've done tested, but hypothetically (assuming latency=1) that amounts to processing 2.5GB/s, i.e. a long way from DDR5 RAM speed and that's not taking into account having to increment pointers to the next byte on each loop. David [1] https://www.agner.org/optimize/instruction_tables.pdf