cpumemory.pdf - cache optimized matrix multiplication
I'm reading cpumemory.pdf from Ulrich Drepper and I'm unable to understand
following part about optimizing cache access in matrix multiplication from
chapter 6.2.1 (page 49-50):
First naive method for matrix multiplication is shown:
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
for (k = 0; k < N; ++k)
res[i][j] += mul1[i][k] * mul2[k][j];
mul2 is accessed by columns so for each column one cache is wasted. Ulrich
says:
With sizeof(double) being 8 this means that, to fully utilize the cache
line, we should unroll the middle loop 8 times.
For brevity I unrolled middle loop only 2 times.
for (i = 0; i < N; ++i)
for (j = 0; j < N; j += 2)
for (k = 0; k < N; ++k) {
res[i][j+0] += mul1[i][k] * mul2[k][j+0];
res[i][j+1] += mul1[i][k] * mul2[k][j+1];
}
Now it's obvious that if cache line is 2 double values wide it'll be fully
utilized. But then Ulrich continues:
Continuing this thought, to effectively use the res matrix as well, i.e.,
to write 8 results at the same time, we should unroll the outer loop 8
times as well.
For brevity I unrolled outer loop only 2 times again.
for (i = 0; i < N; i += 2)
for (j = 0; j < N; j+=2)
for (k = 0; k < N; ++k) {
res[i+0][j+0] += mul1[i+0][k] * mul2[k][j+0];
res[i+0][j+0] += mul1[i+0][k] * mul2[k][j+0];
res[i+1][j+0] += mul1[i+1][k] * mul2[k][j+0];
res[i+1][j+1] += mul1[i+1][k] * mul2[k][j+1];
}
To me it seems even worse than previous version because now mul1 is
accessed by columns. Please explain what Ulrich meant.
No comments:
Post a Comment