INTENTION

Inspired by the effort to integrate JIT for executor acceleration I thought 
selected simple algorithms working with array-oriented data should be 
drastically accelerated by using SIMD instructions on modern hardware.

I want to introduce this style of programming with the example of hex_encode:
- operates on arrays (bytea)
- simple algorithm
- in some situations partially limiting the performance (e.g pg_dump)

IMPLEMENTATION GUIDELINES

The main goal ist to accelerate common cases on the most common hardware by 
exploiting all the resources the hardware delivers.
The following guidelines took me to a first implementation:

- restrict on 64 -bit architectures
        These are the dominant server architectures, have the necessary data 
formats and corresponding registers and operating instructions
- start with Intel x86-64 SIMD instructions:
        This is the vastly most used platform, available for development and in 
practical use
- don’t restrict the concept to only Intel x86-64, so that later people with 
more experience on other architectures can jump in and implement comparable 
algorithms
- fallback to the established implementation in postgres in non appropriate 
cases or on user request (GUC)
- implementation of leaf function/procedures in assembly language
        These consist mostly of a central loop without calling subroutines or 
doing additionally branching

- coding for maximum hardware usage instead of elegant programming
        Once tested, the simple algorithm works as advertised and is used to 
replace most execution parts of the standard implementaion in C

- isolated footprint by integrating it only in the specific subroutine (here 
hex-encode)
        This ensures that the requirements for fast execution are met (e.g. 
buffer sizes) and no repeated checks are needed like in a library use case.

- trying to keep both vector execution ports always doing useful work by 
avoiding waits for latencies

- trying to access memory in linear fashion (reading from input buffer, writing 
to output buffer) to avaoid internal cache problems
- focus optimization for the most advanced SIMD instruction set: AVX512
        This provides the most advanced instructions and  quite a lot of large 
registers to aid in latency avoiding

- if possible provide fallback implementations on older SIMD standards (e.g. 
AVX2 or SSE2)
        This is usefull on many older servers and client processors, but due to 
their too little number of registers latency avoiding or full execution queues 
cannot be fully achieved.


IMPLEMENTATION DETAILS

- The loops implementing the algorithm are written in NASM assembler:
        NASM is actively maintained, has many output formats, follows the Intel 
style, has all current instrucions implemented and is fast.

- The loops are mostly independent of operating systems, so all OS’s basing on 
a NASM obj output format are supported:
        This includes Linux and Windows as the most important ones

- The algorithms use advanced techniques (constant and temporary registers) to 
avoid most unnessary memory accesses:
        The assembly implementation gives you the full control over the 
registers (unlike intrinsics) 

- Multiple dependency chains work interleaved to minimize latencies:
        Coding is often interspersed and using almost all registers available.

- Some instructions (Moves, zeroing) are executed outside the processor 
execution ports:
        These don’t consume execution cyles on a port but their latency has to 
be considered.

- Some vector instructions (multiply add) have latencies of 5 for example:
        This means that after the instruction is issued, the processor has to 
wait 5 cycles until the result can be used in the same dependency chain. To 
avoid this and keep all vector execution ports (p0 and p5) busy you have to 
have 9 other instructions in between doing work on other streams of the 
algorithm to maximize hardware usage and overall performance.

- All loops are implemented as separate C-callable functions (according to the 
OS calling convention):
        They are all leaf functions by calling no other subroutines.

- The decision which implementation is choosen is done at the caller side by a 
special dispatcher routine:
        The caller handles the architectural capabilites (instruction sets 
available) and knows the required work: There is often a suitable minimum 
amount of work required for efficently calling a provided implementation.

- Loops should be run at least 2-4 times to compensate for initializing 
overhead:
        This implicits a certain amount of minimum work count based on the 
specific SIMD implementations

- The loops terminate after detecting an error (e.g. wrong input data) and 
return the succesfull completed amount of work:
        The standard linear implementation takes over with the already 
established error-handling.

- The loops work optimally with some extra output buffer space at the end to be 
able to overshoot in the last round:
        Nonethless the correct amount of work is returned to the caller and a 
vector size of output buffer following the real result is zeroed out (Currently 
disabled!)

- the loop may preload some data after the input buffer but assures that the 
following page boundary is never crossed to avoid any access violation:
        This makes no harm to the memory system because the output buffer has a 
supplemental buffer at the end, but this could be changed to leaving the tail 
handling to the standard implementaion if deemed unsupportable (as for now).


(to be continued...)

Reply via email to