Do you want to use a ubyte instead of a byte here?

Yes, that was a silly mistake. It seems that fixing that removed the need for all the masking operations, which had the biggest speedup.

Also, for your alpha channel:

int alpha = (fg[3] & 0xff) + 1;
int inverseAlpha = 257 - alpha;

If fg[3] = 0 then inverseAlpha = 256, which is out of the range
that can be stored in a ubyte.

I think my logic should be correct. The calculations are done with ints, and the result is then just casted/clamped to the byte. The reason for the +1 is the >> 8, which divides by 256.

class Framebuffer
{
  uint[] framebufferData;
  uint framebufferWidth;
  uint framebufferHeight;
}

void drawRectangle(Framebuffer framebuffer, uint x, uint y, uint width, uint height, uint color)
{
  immutable ubyte* fg = cast(immutable ubyte*)&color;
  immutable uint alpha = fg[3] + 1;
  immutable uint invAlpha = 257 - alpha;
  immutable uint afg0 = alpha * fg[0];
  immutable uint afg1 = alpha * fg[1];
  immutable uint afg2 = alpha * fg[2];

  foreach (i; y .. y + height)
  {
    uint start = x + i * framebuffer.width;

    foreach(j; 0 .. width)
    {
      ubyte* bg = cast(ubyte*)(&framebuffer.data[start + j]);

      bg[0] = cast(ubyte)((afg0 + invAlpha * bg[0]) >> 8);
      bg[1] = cast(ubyte)((afg1 + invAlpha * bg[1]) >> 8);
      bg[2] = cast(ubyte)((afg2 + invAlpha * bg[2]) >> 8);
      bg[3] = 0xff;
    }
  }
}

Can this be made faster with SIMD? (I don't know much about it, maybe the data and algorithm doesn't fit it?)

Can this be parallelized with any real gains?

Reply via email to