On 2/16/24 10:06, Patrick Palka wrote:
On Thu, 15 Feb 2024, Patrick Palka wrote:
One would expect consecutive calls to bytes_in/out::b for streaming
adjacent bits, as we do for tree flag streaming, to at least be
optimized by the compiler into individual bit operations using
statically known bit positions (and ideally merged into larger sized
reads/writes).
Did you have any thoughts about how feasible it would be to use
data-streamer.h? I didn't see a response to richi's mail.
Unfortunately this doesn't happen because the compiler has trouble
tracking the values of this->bit_pos and this->bit_val across such
calls, likely because the compiler doesn't know 'this' and so it's
treated as global memory. This means for each consecutive bit stream
operation, bit_pos and bit_val are loaded from memory, checked if
buffering is needed, and finally the bit is extracted from bit_val
according to the (unknown) bit_pos, even though relative to the previous
operation (if we didn't need to buffer) bit_val is unchanged and bit_pos
is just 1 larger. This ends up being quite slow, with tree_node_bools
taking 10% of time when streaming in parts of the std module.
This patch optimizes this by making tracking of bit_pos and bit_val
easier for the compiler. Rather than bit_pos and bit_val being members
of the (effectively global) bytes_in/out objects, this patch factors out
the bit streaming code/state into separate classes bits_in/out that get
constructed locally as needed for bit streaming. Since these objects
are now clearly local, the compiler can more easily track their values.
Please add this rationale to the bits_in comment.
And since bit streaming is intended to be batched it's natural for these
new classes to be RAII-enabled such that the bit stream is flushed upon
destruction.
In order to make the most of this improved tracking of bit position,
this patch changes parts where we conditionally stream a tree flag
to unconditionally stream (the flag or a dummy value). That way
the number of bits streamed and the respective bit positions are as
statically known as reasonably possible. In lang_decl_bools and
lang_type_bools we flush the current bit buffer at the start so that
subsequent bit positions are statically known. And in core bools, we
can add explicit early exits utilizing invariants that the compiler
can't figure out itself (e.g. a tree code can't have both TS_TYPE_COMMON
and TS_DECL_COMMON, and if a tree code doesn't have TS_DECL_COMMON then
it doesn't have TS_DECL_WITH_VIS). Finally if we're streaming fewer
than 4 bits, it's more space efficient to stream them as individual
bytes rather than as packed bits (due to the 32-bit buffer).
Oops, this last sentence is wrong. Although the size of the bit buffer
is 32 bits, upon flushing we rewind unused bytes within the buffer,
which means streaming 2-8 bits ends up using only one byte not all four.
So v2 below undoes this pessimization.
This patch also moves the definitions of the relevant streaming classes
into anonymous namespaces so that the compiler can make more informed
decisions about inlining their member functions.
I'm curious why you decided to put namespace { } around each class
rather than a larger part of the file? Not asking for a change, just
curious.
I'm also surprised that this would make a difference for inline
functions? Is this just to allow the compiler to inline member
functions defined outside the class body without marking them inline?
In any case, please add a rationale comment to the (first) anonymous
namespace.
After this patch, compile time for a simple Hello World using the std
module is reduced by 7% with a release compiler. The on-disk size of
the std module increases by 0.7% (presumably due to the extra flushing
done in lang_decl_bools and lang_type_bools).
The on-disk std module now only grows 0.4% instead of 0.7%.
The bit stream out performance isn't improved as much as the stream in
due to the spans/lengths instrumentation performed on stream out (which
probably should be e.g. removed for release builds?)
Based on CHECKING_P, sure.
diff --git a/gcc/cp/module.cc b/gcc/cp/module.cc
index 0291d456ff5..2ecee007e8a 100644
--- a/gcc/cp/module.cc
+++ b/gcc/cp/module.cc
@@ -694,13 +656,126 @@ protected:
/* Instrumentation. */
static unsigned spans[4];
static unsigned lengths[4];
- static int is_set;
+ friend struct bits_out;
};
+} // anon namespace
+
+/* Finish bit packet. Rewind the bytes not used. */
Missing blank line.
+static unsigned
+bit_flush (data& bits, uint32_t& bit_val, unsigned& bit_pos)
+{
+ gcc_assert (bit_pos);
+ unsigned bytes = (bit_pos + 7) / 8;
+ bits.unuse (4 - bytes);
+ bit_pos = 0;
+ bit_val = 0;
+ return bytes;
+}
+
@@ -5314,6 +5326,8 @@ trees_out::core_bools (tree t)
if (TREE_CODE_CLASS (code) != tcc_type)
/* This is TYPE_CACHED_VALUES_P for types. */
WB (t->base.public_flag);
+ else
+ WB (false);
Can we simplify the pattern for conditionally writing/reading? It looks
easy to forget to add the else. Perhaps a COND_WB macro with rationale
comment?
Jason