Hi Folks,
This is an edited version of a message posted on the LLVM Discourse.
I want to share what I have been working on as I feel it may be of
interest to the GCC compiler developers, specifically concerning alias
analysis and optimizations for iteration of sparse block-based
multi-arrays. I also have questions about optimization related to this
implementation, specifically the observability of alias analysis
pessimization and memory to register optimizations.
I have been working on _zip_vector_. _zip_vector_ is a compressed
variable length array that uses vectorized block codecs to compress and
decompress integers using dense variable bit width deltas as well as
compressing constant values and sequences. _zip_vector_ employs integer
block codecs optimized for vector instruction sets using the Google
Highway C++ library for portable SIMD/vector intrinsics.
The high-level class supports 32-bit and 64-bit compressed integer arrays:
- `zip_vector<i32>`
- { 8, 16, 24 } bit signed and unsigned fixed-width values.
- { 8, 16, 24 } bit signed deltas and per block IV.
- constants and sequences using per block IV and delta.
- `zip_vector<i64>`
- { 8, 16, 24, 32, 48 } bit signed and unsigned fixed-width values.
- { 8, 16, 24, 32, 48 } bit signed deltas with per block IV.
- constants and sequences using per block IV and delta.
Here is a link to the implementation:
- https://github.com/metaparadigm/zvec/
The README has a background on the delta encoding scheme. If you read
the source, "zvec_codecs.h" contains the low-level vectorized block
codecs while "zvec_block.h" contains a high-level interface to the block
codecs using cpuid-based dynamic dispatch. The high-level sparse integer
vector class leveraging the block codecs is in "zip_vector.h". It has
been tested with GCC and LLVM on x86-64 using SSE3, AVX, and AVX-512.
The principle of operation is to employ simplified block codecs
dedicated to only compressing fixed-width integers and thus are
extremely fast, unlike typical compression algorithms. They are _in the
order of 30-150GiB/sec_ on a single core when running within the L1
cache on Skylake AVX-512. From this perspective, zip_vector achieves its
performance by reducing global memory bandwidth because it fetches and
stores compressed data to and from RAM and then uses extremely fast
vector codecs to pack and unpack compressed blocks within the L1 cache.
From this perspective, it is similar to texture compression codecs, but
the specific use case is closer to storage for index arrays because the
block codecs are lossless integer codecs. The performance is striking in
that it can be faster for in-order read-only traversal than a regular
array, while the primary goal is footprint reduction.
The design use case is an offsets array that might contain 64-bit values
but usually contains smaller values. From this perspective, we wanted
the convenience of simply using `zip_vector<i64>` or `zip_vector<i32>`
while benefiting from the space advantages of storing data using 8, 16,
24, 32, and 48-bit deltas.
Q. Why is it specifically of interest to GCC developers?
I think the best way to answer this is with questions. How can we model
a block-based iterator for a sparse array that is amenable to vectorization?
There are aspects to the zip_vector iterator design that are *not done
yet* concerning its current implementation. The iteration has two
phases. There is an inter-block phase at the boundary of each block (the
logic inside of `switch_page`) that scans and compresses the previously
active block, updates the page index, and decompresses the next block.
Then there is a _broad-phase_ for intra-block accesses, which is
amenable to vectorization due to the use of fixed-size blocks.
*Making 1D iteration as fast as 2D iteration*
Firstly I have to say that there is a lot of analysis for the
optimization of the iterator that I would like to discuss. There is the
issue of hoisting the inter-block boundary test from the fast path so
that during block boundary traversal, subsequent block endings are
calculated in advance so that the broad phase only requires a pointer
increment and comparison with the addresses held in registers.
The challenge is getting past compiler alias analysis. Alias analysis
seems to prevent caching of the sum of the slab base address and active
area offset in a register versus being demoted to memory accesses. These
member variables hold the location of the slab and the offset to the
uncompressed page which are both on the critical path. When these values
are in memory, _it adds 4 or more cycles of latency for base address
calculation on every access_. There is also the possibility to hoist and
fold the active page check as we know we can make constructive proofs
concerning changes to that value.
Benchmarks compare the performance of 1D and 2D style iterators. At
certain times the compiler would hoist the base and offset pointers from
member variable accesses into registers in the 1D version making a
noticeable difference in performance. In some respects, from the
perspective of single-threaded code, the only way the pointer to the
active region can change is inside `switch_page(size_t y)`.
The potential payoff is huge because one may be able to access data ~
0.9X - 3.5X faster than simply accessing integers in RAM when combining
the reduction in global memory bandwidth with auto-vectorization, but
the challenge is generating safe code for the simpler 1D iteration case
that is as efficient as explicit 2D iteration.
1D iteration:
for (auto x : vec) x2 += x;
2D iteration:
for (size_t i = 0; i < n; i += decltype(vec)::page_interval) {
i64 *cur = &vec[i], *end = cur + decltype(vec)::page_interval;
while(cur != end) x2 += *cur++;
}
Note: In this example, I avoid having a different size loop tail but
that is also a consideration.
I trialled several techniques using a simplified version of the
`zip_vector` class where `switch_page` was substituted with simple logic
so that it was possible to get the compiler to coalesce the slab base
pointer and active area offset into a single calculation upon page
crossings. There is also hoisting of the active_page check
(_y-parameter_) to only occur on block crossings. I found that when the
`switch_page` implementation became more complex, i.e. probably adding
an extern call to `malloc`, the compiler would resort to more
conservatively fetching through a pointer to a member variable for the
base pointer and offset calculation. See here:
https://github.com/metaparadigm/zvec/blob/756e583472028fcc36e94c0519926978094dbb00/src/zip_vector.h#L491-L496
So I got to the point where I thought it would help to get input from
compiler developers to figure out how to observe which internal
constraints are violated by "switch_page" preventing the base pointer
and offset address calculation from being cached in registers.
"slab_data" and "active_area" are neither volatile nor atomic, so
threads should not expect their updates to be atomic or go through memory.
I tried a large number of small experiments. e.g. so let's try and
collapse "slab_data" and "active_area" into one pointer at the end of
"switch_page" so that we only have one pointer to access. Also, the
"active_page" test doesn't necessarily need to be in the broad phase. I
had attempted to manually hoist these variables by modifying the
iterators but found it was necessary to keep them where they were to
avoid introducing stateful invariants to the iterators that could become
invalidated due to read accesses.
Stack/register-based coroutines could help due to the two distinct
states in the iterator.
I want to say that it is not as simple as one might think on the
surface. I tried several approaches to coalesce address calculations and
move them into the page switch logic, all leading to performance
fall-off, almost like the compiler was carrying some pessimization that
forced touched member variables to be accessed via memory versus
registers. At one stage the 1D form was going fast with GCC, but after
adding support for `zip_vector<i32>` and `zip_vector<i64>`, I found that
performance fell off. So I would like to observe exactly which code
causes pessimization of accesses to member variables, preventing them
from being held in registers and causing accesses to go to memory
instead. It seems it should be possible to get 1D iteration to be faster
than _std::vector_ as I did witness this with GCC but the optimization
does not seem to be stable.
So that's what I would like help with...
Regarding the license, zip vector and its block codecs are released
under "PLEASE LICENSE", a permissive ISC-derived license with a notice
about implied copyright. The license removes the ISC restriction that
all copies must include the copyright message, so while it is still
copyright material i.e. it is not public domain, it is, in fact,
compatible with the Apache Software License.
Please have a look at the benchmarks.
Regards,
Michael J. Clark
Twitter: @larkmjc