On Fri, Dec 31, 2021 at 9:32 AM Hans Buschmann <buschm...@nidsa.net> wrote:
> Inspired by the effort to integrate JIT for executor acceleration I thought > selected simple algorithms working with array-oriented data should be > drastically accelerated by using SIMD instructions on modern hardware. Hi Hans, I have experimented with SIMD within Postgres last year, so I have some idea of the benefits and difficulties. I do think we can profit from SIMD more, but we must be very careful to manage complexity and maximize usefulness. Hopefully I can offer some advice. > - restrict on 64 -bit architectures > These are the dominant server architectures, have the necessary data > formats and corresponding registers and operating instructions > - start with Intel x86-64 SIMD instructions: > This is the vastly most used platform, available for development and > in practical use > - don’t restrict the concept to only Intel x86-64, so that later people with > more experience on other architectures can jump in and implement comparable > algorithms > - fallback to the established implementation in postgres in non appropriate > cases or on user request (GUC) These are all reasonable goals, except GUCs are the wrong place to choose hardware implementations -- it should Just Work. > - coding for maximum hardware usage instead of elegant programming > Once tested, the simple algorithm works as advertised and is used to > replace most execution parts of the standard implementaion in C -1 Maintaining good programming style is a key goal of the project. There are certainly non-elegant parts in the code, but that has a cost and we must consider tradeoffs carefully. I have read some of the optimized code in glibc and it is not fun. They at least know they are targeting one OS and one compiler -- we don't have that luxury. > - focus optimization for the most advanced SIMD instruction set: AVX512 > This provides the most advanced instructions and quite a lot of > large registers to aid in latency avoiding -1 AVX512 is a hodge-podge of different instruction subsets and are entirely lacking on some recent Intel server hardware. Also only available from a single chipmaker thus far. > - The loops implementing the algorithm are written in NASM assembler: > NASM is actively maintained, has many output formats, follows the > Intel style, has all current instrucions implemented and is fast. > - The loops are mostly independent of operating systems, so all OS’s basing > on a NASM obj output format are supported: > This includes Linux and Windows as the most important ones > - The algorithms use advanced techniques (constant and temporary registers) > to avoid most unnessary memory accesses: > The assembly implementation gives you the full control over the > registers (unlike intrinsics) On the other hand, intrinsics are easy to integrate into a C codebase and relieve us from thinking about object formats. A performance feature that happens to work only on common OS's is probably fine from the user point of view, but if we have to add a lot of extra stuff to make it work at all, that's not a good trade off. "Mostly independent" of the OS is not acceptable -- we shouldn't have to think about the OS at all when the coding does not involve OS facilities (I/O, processes, etc). > As an example I think of pg_dump to dump a huge amount of bytea data (not > uncommon in real applications). Most of these data are in toast tables, often > uncompressed due to their inherant structure. The dump must read the toast > pages into memory, decompose the page, hexdump the content, put the result in > an output buffer and trigger the I/O. By integrating all these steps into one > big performance improvements can be achieved (but naturally not here in my > first implementation!). Seems like a reasonable area to work on, but I've never measured. > The best result I could achieve was roughly 95 seconds for 1 Million dumps of > 1718 KB on a Tigerlake laptop using AVX512. This gives about 18 GB/s > source-hexdumping rate on a single core! > > In another run with postgres the time to hexdump about half a million tuples > with a bytea column yeilding about 6 GB of output reduced the time from about > 68 seconds to 60 seconds, which clearly shows the postgres overhead for > executing the copy command on such a data set. I don't quite follow -- is this patched vs. unpatched Postgres? I'm not sure what's been demonstrated. > The assembler routines should work on most x86-64 operating systems, but for > the moment only elf64 and WIN64 output formats are supported. > > The standard calling convention is followed mostly in the LINUX style, on > Windows the parameters are moved around accordingly. The same > assembler-source-code can be used on both platforms. > I have updated the makefile to include the nasm command and the nasm flags, > but I need help to make these based on configure. > > I also have no knowledge on other operating systems (MAC-OS etc.) > > The calling conventions can be easily adopted if they differ but somebody > else should jump in for testing. As I implied earlier, this is way too low-level. If we have to worry about obj formats and calling conventions, we'd better be getting something *really* amazing in return. > But I really need help by an expert to integrate it in the perl building > process. > I would much appreciate if someone else could jump in for a patch to > configure-integration and another patch for .vcxobj integration. It's a bit presumptuous to enlist others for specific help without general agreement on the design, especially on the most tedious parts. Also, here's a general engineering tip: If the non-fun part is too complex for you to figure out, that might indicate the fun part is too ambitious. I suggest starting with a simple patch with SSE2 (always present on x86-64) intrinsics, one that anyone can apply and test without any additional work. Then we can evaluate if the speed-up in the hex encoding case is worth some additional complexity. As part of that work, it might be good to see if some portable improved algorithm is already available somewhere. > There is much room for other implementations (checksum verification/setting, > aggregation, numeric datatype, merging, generate_series, integer and floating > point output …) which could be addressed later on. Float output has already been pretty well optimized. CRC checksums already have a hardware implementation on x86 and Arm. I don't know of any practical workload where generate_series() is too slow. Aggregation is an interesting case, but I'm not sure what the current bottlenecks are. -- John Naylor EDB: http://www.enterprisedb.com