On Fri, Dec 31, 2021 at 9:32 AM Hans Buschmann <buschm...@nidsa.net> wrote:

> Inspired by the effort to integrate JIT for executor acceleration I thought 
> selected simple algorithms working with array-oriented data should be 
> drastically accelerated by using SIMD instructions on modern hardware.

Hi Hans,

I have experimented with SIMD within Postgres last year, so I have
some idea of the benefits and difficulties. I do think we can profit
from SIMD more, but we must be very careful to manage complexity and
maximize usefulness. Hopefully I can offer some advice.

> - restrict on 64 -bit architectures
>         These are the dominant server architectures, have the necessary data 
> formats and corresponding registers and operating instructions
> - start with Intel x86-64 SIMD instructions:
>         This is the vastly most used platform, available for development and 
> in practical use
> - don’t restrict the concept to only Intel x86-64, so that later people with 
> more experience on other architectures can jump in and implement comparable 
> algorithms
> - fallback to the established implementation in postgres in non appropriate 
> cases or on user request (GUC)

These are all reasonable goals, except GUCs are the wrong place to
choose hardware implementations -- it should Just Work.

> - coding for maximum hardware usage instead of elegant programming
>         Once tested, the simple algorithm works as advertised and is used to 
> replace most execution parts of the standard implementaion in C

-1

Maintaining good programming style is a key goal of the project. There
are certainly non-elegant parts in the code, but that has a cost and
we must consider tradeoffs carefully. I have read some of the
optimized code in glibc and it is not fun. They at least know they are
targeting one OS and one compiler -- we don't have that luxury.

> - focus optimization for the most advanced SIMD instruction set: AVX512
>         This provides the most advanced instructions and  quite a lot of 
> large registers to aid in latency avoiding

-1

AVX512 is a hodge-podge of different instruction subsets and are
entirely lacking on some recent Intel server hardware. Also only
available from a single chipmaker thus far.

> - The loops implementing the algorithm are written in NASM assembler:
>         NASM is actively maintained, has many output formats, follows the 
> Intel style, has all current instrucions implemented and is fast.

> - The loops are mostly independent of operating systems, so all OS’s basing 
> on a NASM obj output format are supported:
>         This includes Linux and Windows as the most important ones

> - The algorithms use advanced techniques (constant and temporary registers) 
> to avoid most unnessary memory accesses:
>         The assembly implementation gives you the full control over the 
> registers (unlike intrinsics)

On the other hand, intrinsics are easy to integrate into a C codebase
and relieve us from thinking about object formats. A performance
feature that happens to work only on common OS's is probably fine from
the user point of view, but if we have to add a lot of extra stuff to
make it work at all, that's not a good trade off. "Mostly independent"
of the OS is not acceptable -- we shouldn't have to think about the OS
at all when the coding does not involve OS facilities (I/O, processes,
etc).

> As an example I think of pg_dump to dump a huge amount of bytea data (not 
> uncommon in real applications). Most of these data are in toast tables, often 
> uncompressed due to their inherant structure. The dump must read the toast 
> pages into memory, decompose the page, hexdump the content, put the result in 
> an output buffer and trigger the I/O. By integrating all these steps into one 
> big performance improvements can be achieved (but naturally not here in my 
> first implementation!).

Seems like a reasonable area to work on, but I've never measured.

> The best result I could achieve was roughly 95 seconds for 1 Million dumps of 
> 1718 KB on a Tigerlake laptop using AVX512. This gives about 18 GB/s 
> source-hexdumping rate on a single core!
>
> In another run with postgres the time to hexdump about half a million tuples 
> with a bytea column yeilding about 6 GB of output reduced the time from about 
> 68 seconds to 60 seconds, which clearly shows the postgres overhead for 
> executing the copy command on such a data set.

I don't quite follow -- is this patched vs. unpatched Postgres? I'm
not sure what's been demonstrated.

> The assembler routines should work on most x86-64 operating systems, but for 
> the moment only elf64 and WIN64 output formats are supported.
>
> The standard calling convention is followed mostly in the LINUX style, on 
> Windows the parameters are moved around accordingly. The same 
> assembler-source-code can be used on both platforms.

> I have updated the makefile to include the nasm command and the nasm flags, 
> but I need help to make these based on configure.
>
> I also have no knowledge on other operating systems (MAC-OS etc.)
>
> The calling conventions can be easily adopted if they differ but somebody 
> else should jump in for testing.

As I implied earlier, this is way too low-level. If we have to worry
about obj formats and calling conventions, we'd better be getting
something *really* amazing in return.

> But I really need help by an expert to integrate it in the perl building 
> process.

> I would much appreciate if someone else could jump in for a patch to 
> configure-integration and another patch for .vcxobj integration.

It's a bit presumptuous to enlist others for specific help without
general agreement on the design, especially on the most tedious parts.
Also, here's a general engineering tip: If the non-fun part is too
complex for you to figure out, that might indicate the fun part is too
ambitious. I suggest starting with a simple patch with SSE2 (always
present on x86-64) intrinsics, one that anyone can apply and test
without any additional work. Then we can evaluate if the speed-up in
the hex encoding case is worth some additional complexity. As part of
that work, it might be good to see if some portable improved algorithm
is already available somewhere.

> There is much room for other implementations (checksum verification/setting, 
> aggregation, numeric datatype, merging, generate_series, integer and floating 
> point output …) which could be addressed later on.

Float output has already been pretty well optimized. CRC checksums
already have a hardware implementation on x86 and Arm. I don't know of
any practical workload where generate_series() is too slow.
Aggregation is an interesting case, but I'm not sure what the current
bottlenecks are.

-- 
John Naylor
EDB: http://www.enterprisedb.com


Reply via email to