Re: Make printtup a bit faster

David Rowley Thu, 29 Aug 2024 17:31:38 -0700

On Fri, 30 Aug 2024 at 03:33, Tom Lane <t...@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowle...@gmail.com> writes:
> > [ redesign I/O function APIs ]
> > I had planned to work on this for PG18, but I'd be happy for some
> > assistance if you're willing.
>
> I'm skeptical that such a thing will ever be practical.  To avoid
> breaking un-converted data types, all the call sites would have to
> support both old and new APIs.  To avoid breaking non-core callers,
> all the I/O functions would have to support both old and new APIs.
> That probably adds enough overhead to negate whatever benefit you'd
> get.


Scepticism is certainly good when it comes to such a large API change.
I don't want to argue with you, but I'd like to state a few things
about why I think you're wrong on this...

So, we currently return cstrings in our output functions. Let's take
jsonb_out() as an example, to build that cstring, we make a *new*
StringInfoData on *every call* inside JsonbToCStringWorker(). That
gives you 1024 bytes before you need to enlarge it. However, it's
maybe not all bad as we have some size estimations there to call
enlargeStringInfo(), only that's a bit wasteful as it does a
repalloc() which memcpys the freshly allocated 1024 bytes allocated in
initStringInfo() when it doesn't yet contain any data. After
jsonb_out() has returned and we have the cstring, only we forgot the
length of the string, so most places will immediately call strlen() or
do that indirectly via appendStringInfoString(). For larger JSON
documents, that'll likely require pulling cachelines back into L1
again. I don't know how modern CPU cacheline eviction works, but if it
was as simple as FIFO then the strlen() would flush all those
cachelines only for memcpy() to have to read them back again for
output strings larger than L1.

If we rewrote all of core's output functions to use the new API, then
the branch to test the function signature would be perfectly
predictable and amount to an extra cmp and jne/je opcode. So, I just
don't agree with the overheads negating the benefits comment. You're
probably off by 1 order of magnitude at the minimum and for
medium/large varlena types likely 2-3+ orders. Even a simple int4out
requires a palloc()/memcpy. If we were outputting lots of data, e.g.
in a COPY operation, the output buffer would seldom need to be
enlarged as it would quickly adjust to the correct size.

For the input functions, the possible gains are extensive too.
textin() is a good example, it uses cstring_to_text(), but could be
changed to use cstring_to_text_with_len(). Knowing the input string
length also opens the door to SIMD. Take int4in() as an example, if
pg_strtoint32_safe() knew its input length there are a bunch of
prechecks that could be done with either 64-bit SWAR or with SIMD.
For example, if you knew you had an 8-char string of decimal digits
then converting that to an int32 is quite cheap. It's impossible to
overflow an int32 with 8 decimal digits, so no overflow checks need to
be done until there are at least 10 decimal digits. ca6fde922 seems
like good enough example of the possible gains of SIMD vs
byte-at-a-time processing. I saw some queries go 4x faster there and
that was me trying to keep the JSON document sizes realistic.
byte-at-a-time is just not enough to saturate RAM speed. Take DDR5,
for example, Wikipedia says it has a bandwidth of 32–64 GB/s, so
unless we discover room temperature superconductors, we're not going
to see any massive jump in clock speeds any time soon, and with 5 or
6Ghz CPUs, there's just no way to get anywhere near that bandwidth by
processing byte-at-a-time. For some sort of nieve strcpy() type
function, you're going to need at least a cmp and mov, even if those
were latency=1 (which they're not, see [1]), you can only do 2.5
billion of those two per second on a 5Ghz processor. I've done tested,
but hypothetically (assuming latency=1) that amounts to processing
2.5GB/s, i.e. a long way from DDR5 RAM speed and that's not taking
into account having to increment pointers to the next byte on each
loop.

David

[1] https://www.agner.org/optimize/instruction_tables.pdf

Re: Make printtup a bit faster

Reply via email to