H We have been profiling FFmpeg at Microsoft and have identified that ff_xyz12ToRgb48 has a high sample count ( profiled every 1ms )
It seems like ff_xyz12ToRgb48 has performance penalty for 1. Unaligned read and write access 2. Access to xyz2rgb_matrix 3. Multiplication I would be interested in optimizing this code , wanted to check if there is an existing optimized version of this function, or any recommended approach to improve it(? I can move the repeated access to xyz2rgb_matrix outside the inner loop and load a full cache line at once to extract the X, Y, and Z values more efficiently-but I wanted to start by getting some initial feedback or thoughts before proceeding further File : FFmpeg/libswscale/swscale.c at a79720e10f30e9fd18bd78242ce96dde06461343 * FFmpeg/FFmpeg<https://github.com/FFmpeg/FFmpeg/blob/a79720e10f30e9fd18bd78242ce96dde06461343/libswscale/swscale.c#L739> void ff_xyz12Torgb48(const SwsInternal *c, uint8_t *dst, int dst_stride, const uint8_t *src, int src_stride, int w, int h) { .......... Unaligned read ....................... x = AV_RL16(src16 + xp + 0); y = AV_RL16(src16 + xp + 1); z = AV_RL16(src16 + xp + 2); .......... DRAM Access and multiply ....................... // convert from XYZlinear to sRGBlinear r = c->xyz2rgb_matrix[0][0] * x + c->xyz2rgb_matrix[0][1] * y + c->xyz2rgb_matrix[0][2] * z >> 12; g = c->xyz2rgb_matrix[1][0] * x + c->xyz2rgb_matrix[1][1] * y + c->xyz2rgb_matrix[1][2] * z >> 12; b = c->xyz2rgb_matrix[2][0] * x + c->xyz2rgb_matrix[2][1] * y + c->xyz2rgb_matrix[2][2] * z >> 12; .......... RMW Access ....................... AV_WL16(dst16 + xp + 0, c->rgbgamma[r] << 4); AV_WL16(dst16 + xp + 1, c->rgbgamma[g] << 4); AV_WL16(dst16 + xp + 2, c->rgbgamma[b] << 4); Regards Chitra _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".