Do you want to use a ubyte instead of a byte here?
Yes, that was a silly mistake. It seems that fixing that removed
the need for all the masking operations, which had the biggest
speedup.
Also, for your alpha channel:
int alpha = (fg[3] & 0xff) + 1;
int inverseAlpha = 257 - alpha;
If fg[3] = 0 then inverseAlpha = 256, which is out of the range
that can be stored in a ubyte.
I think my logic should be correct. The calculations are done
with ints, and the result is then just casted/clamped to the
byte. The reason for the +1 is the >> 8, which divides by 256.
class Framebuffer
{
uint[] framebufferData;
uint framebufferWidth;
uint framebufferHeight;
}
void drawRectangle(Framebuffer framebuffer, uint x, uint y, uint
width, uint height, uint color)
{
immutable ubyte* fg = cast(immutable ubyte*)&color;
immutable uint alpha = fg[3] + 1;
immutable uint invAlpha = 257 - alpha;
immutable uint afg0 = alpha * fg[0];
immutable uint afg1 = alpha * fg[1];
immutable uint afg2 = alpha * fg[2];
foreach (i; y .. y + height)
{
uint start = x + i * framebuffer.width;
foreach(j; 0 .. width)
{
ubyte* bg = cast(ubyte*)(&framebuffer.data[start + j]);
bg[0] = cast(ubyte)((afg0 + invAlpha * bg[0]) >> 8);
bg[1] = cast(ubyte)((afg1 + invAlpha * bg[1]) >> 8);
bg[2] = cast(ubyte)((afg2 + invAlpha * bg[2]) >> 8);
bg[3] = 0xff;
}
}
}
Can this be made faster with SIMD? (I don't know much about it,
maybe the data and algorithm doesn't fit it?)
Can this be parallelized with any real gains?