Hi Jerry,
With these changes, OK for trunk?
Just going over this with a fine comb... One thing just struck me: The loop variables should be index_type, so const index_type m = xcount, n = ycount, k = count; [...] index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2, i3, i4, i5, i6; /* Local variables */ GFC_REAL_4 t1[65536], /* was [256][256] */ f11, f12, f21, f22, f31, f32, f41, f42, f13, f14, f23, f24, f33, f34, f43, f44; index_type i, j, l, ii, jj, ll; index_type isec, jsec, lsec, uisec, ujsec, ulsec; I agree that we should do the tuning of the inline limit separately. When we do that, we should think about -Os. With the buffering, we have much more memory usage in the library function. If -Os is in force, we should also consider raising the limit for inlining. Since I was involved in the development, I would like to give others a few days to raise more comments. If there are none, OK to commit with the above change within a few days. Of course, somebody else might also OK this patch :-) Regards Thomas