Junchao Zhang <[email protected]> writes:
> I expected TACO was better since its website says "It uses novel compiler
> techniques to get performance competitive with hand-optimized kernels"
The TACO generated code is basically the same, modulo cosmetic partitioning
into groups of 32 rows.
#pragma omp parallel for schedule(runtime)
for (int32_t i0 = 0; i0 < ((A1_dimension + 31) / 32); i0++) {
for (int32_t i1 = 0; i1 < 32; i1++) {
int32_t i = i0 * 32 + i1;
if (i >= A1_dimension)
continue;
double tjy_val = 0.0;
for (int32_t jA = A2_pos[i]; jA < A2_pos[(i + 1)]; jA++) {
int32_t j = A2_crd[jA];
tjy_val += A_vals[jA] * x_vals[j];
}
y_vals[i] = tjy_val;
}
}
As Matt says, this is mostly a bandwidth test. You can do somewhat better for
matrices with similar row lengths using SELL (which enables vector
instructions), but it's secondary to bandwidth for matrix values and cache
locality for the vector.
With a graph and a row-based algorithm, I would partition it using METIS or
similar, then apply an RCM ordering within each block. This will give better
locality of access to the vector x, which has been known (since at least the
PETSc-FUN3D work in the late 90s) to be a major factor in SpMV performance.
Working in naive orderings is one of the reasons the OSKI work was able to
claim significant performance benefits while having virtually no impact on
well-optimized applications (which choose a good ordering before forming their
matrices).