[RFC] zip_vector: in-memory block compression of integer arrays

Michael Clark via Gcc Tue, 16 Aug 2022 23:35:02 -0700

Hi Folks,

This is an edited version of a message posted on the LLVM Discourse.

I want to share what I have been working on as I feel it may be ofinterest to the GCC compiler developers, specifically concerning aliasanalysis and optimizations for iteration of sparse block-basedmulti-arrays. I also have questions about optimization related to thisimplementation, specifically the observability of alias analysispessimization and memory to register optimizations.

I have been working on _zip_vector_. _zip_vector_ is a compressedvariable length array that uses vectorized block codecs to compress anddecompress integers using dense variable bit width deltas as well ascompressing constant values and sequences. _zip_vector_ employs integerblock codecs optimized for vector instruction sets using the GoogleHighway C++ library for portable SIMD/vector intrinsics.


The high-level class supports 32-bit and 64-bit compressed integer arrays:

 - `zip_vector<i32>`
   - { 8, 16, 24 } bit signed and unsigned fixed-width values.
   - { 8, 16, 24 } bit signed deltas and per block IV.
   - constants and sequences using per block IV and delta.
 - `zip_vector<i64>`
   - { 8, 16, 24, 32, 48 } bit signed and unsigned fixed-width values.
   - { 8, 16, 24, 32, 48 } bit signed deltas with per block IV.
   - constants and sequences using per block IV and delta.

Here is a link to the implementation:

- https://github.com/metaparadigm/zvec/

The README has a background on the delta encoding scheme. If you readthe source, "zvec_codecs.h" contains the low-level vectorized blockcodecs while "zvec_block.h" contains a high-level interface to the blockcodecs using cpuid-based dynamic dispatch. The high-level sparse integervector class leveraging the block codecs is in "zip_vector.h". It hasbeen tested with GCC and LLVM on x86-64 using SSE3, AVX, and AVX-512.

The principle of operation is to employ simplified block codecsdedicated to only compressing fixed-width integers and thus areextremely fast, unlike typical compression algorithms. They are _in theorder of 30-150GiB/sec_ on a single core when running within the L1cache on Skylake AVX-512. From this perspective, zip_vector achieves itsperformance by reducing global memory bandwidth because it fetches andstores compressed data to and from RAM and then uses extremely fastvector codecs to pack and unpack compressed blocks within the L1 cache.From this perspective, it is similar to texture compression codecs, butthe specific use case is closer to storage for index arrays because theblock codecs are lossless integer codecs. The performance is striking inthat it can be faster for in-order read-only traversal than a regulararray, while the primary goal is footprint reduction.

The design use case is an offsets array that might contain 64-bit valuesbut usually contains smaller values. From this perspective, we wantedthe convenience of simply using `zip_vector<i64>` or `zip_vector<i32>`while benefiting from the space advantages of storing data using 8, 16,24, 32, and 48-bit deltas.


Q. Why is it specifically of interest to GCC developers?

I think the best way to answer this is with questions. How can we modela block-based iterator for a sparse array that is amenable to vectorization?

There are aspects to the zip_vector iterator design that are *not doneyet* concerning its current implementation. The iteration has twophases. There is an inter-block phase at the boundary of each block (thelogic inside of `switch_page`) that scans and compresses the previouslyactive block, updates the page index, and decompresses the next block.Then there is a _broad-phase_ for intra-block accesses, which isamenable to vectorization due to the use of fixed-size blocks.


*Making 1D iteration as fast as 2D iteration*

Firstly I have to say that there is a lot of analysis for theoptimization of the iterator that I would like to discuss. There is theissue of hoisting the inter-block boundary test from the fast path sothat during block boundary traversal, subsequent block endings arecalculated in advance so that the broad phase only requires a pointerincrement and comparison with the addresses held in registers.

The challenge is getting past compiler alias analysis. Alias analysisseems to prevent caching of the sum of the slab base address and activearea offset in a register versus being demoted to memory accesses. Thesemember variables hold the location of the slab and the offset to theuncompressed page which are both on the critical path. When these valuesare in memory, _it adds 4 or more cycles of latency for base addresscalculation on every access_. There is also the possibility to hoist andfold the active page check as we know we can make constructive proofsconcerning changes to that value.

Benchmarks compare the performance of 1D and 2D style iterators. Atcertain times the compiler would hoist the base and offset pointers frommember variable accesses into registers in the 1D version making anoticeable difference in performance. In some respects, from theperspective of single-threaded code, the only way the pointer to theactive region can change is inside `switch_page(size_t y)`.

The potential payoff is huge because one may be able to access data ~0.9X - 3.5X faster than simply accessing integers in RAM when combiningthe reduction in global memory bandwidth with auto-vectorization, butthe challenge is generating safe code for the simpler 1D iteration casethat is as efficient as explicit 2D iteration.


1D iteration:

    for (auto x : vec) x2 += x;

2D iteration:

    for (size_t i = 0; i < n; i += decltype(vec)::page_interval) {
        i64 *cur = &vec[i], *end = cur + decltype(vec)::page_interval;
        while(cur != end) x2 += *cur++;
    }

Note: In this example, I avoid having a different size loop tail butthat is also a consideration.

I trialled several techniques using a simplified version of the`zip_vector` class where `switch_page` was substituted with simple logicso that it was possible to get the compiler to coalesce the slab basepointer and active area offset into a single calculation upon pagecrossings. There is also hoisting of the active_page check(_y-parameter_) to only occur on block crossings. I found that when the`switch_page` implementation became more complex, i.e. probably addingan extern call to `malloc`, the compiler would resort to moreconservatively fetching through a pointer to a member variable for thebase pointer and offset calculation. See here:


https://github.com/metaparadigm/zvec/blob/756e583472028fcc36e94c0519926978094dbb00/src/zip_vector.h#L491-L496

So I got to the point where I thought it would help to get input fromcompiler developers to figure out how to observe which internalconstraints are violated by "switch_page" preventing the base pointerand offset address calculation from being cached in registers."slab_data" and "active_area" are neither volatile nor atomic, sothreads should not expect their updates to be atomic or go through memory.

I tried a large number of small experiments. e.g. so let's try andcollapse "slab_data" and "active_area" into one pointer at the end of"switch_page" so that we only have one pointer to access. Also, the"active_page" test doesn't necessarily need to be in the broad phase. Ihad attempted to manually hoist these variables by modifying theiterators but found it was necessary to keep them where they were toavoid introducing stateful invariants to the iterators that could becomeinvalidated due to read accesses.

Stack/register-based coroutines could help due to the two distinctstates in the iterator.

I want to say that it is not as simple as one might think on thesurface. I tried several approaches to coalesce address calculations andmove them into the page switch logic, all leading to performancefall-off, almost like the compiler was carrying some pessimization thatforced touched member variables to be accessed via memory versusregisters. At one stage the 1D form was going fast with GCC, but afteradding support for `zip_vector<i32>` and `zip_vector<i64>`, I found thatperformance fell off. So I would like to observe exactly which codecauses pessimization of accesses to member variables, preventing themfrom being held in registers and causing accesses to go to memoryinstead. It seems it should be possible to get 1D iteration to be fasterthan _std::vector_ as I did witness this with GCC but the optimizationdoes not seem to be stable.


So that's what I would like help with...

Regarding the license, zip vector and its block codecs are releasedunder "PLEASE LICENSE", a permissive ISC-derived license with a noticeabout implied copyright. The license removes the ISC restriction thatall copies must include the copyright message, so while it is stillcopyright material i.e. it is not public domain, it is, in fact,compatible with the Apache Software License.


Please have a look at the benchmarks.

Regards,

Michael J. Clark
Twitter: @larkmjc

[RFC] zip_vector: in-memory block compression of integer arrays

Reply via email to