Name: Cholesky
Contact Person: OmpSs@FPGA Team, [email protected]
License Agreement: GPL 3.0
Platform: OmpSs@FPGA+IMP+OMPIF
The Cholesky benchmark decomposes a Hermitian positive-definite square symmetric matrix into a lower-triangular matrix that solves the following equation:
potrf
from the LAPACK library which is implemented in Fortran, thus it expects column-major.
Second, this layout is useful for the FPGA implementation as we can optimize better the code, more details in the respective section.
To parallelize Cholesky using tasks, we distribute the matrix into multiple square blocks, and each task operates at the block level. There are 4 kernels in total:
static const int BS = ...;
#pragma oss task inout(A)
void potrf(float A[BS][BS]);
#pragma oss task in(A) inout(B)
void trsm(floag A[BS][BS], float B[BS][BS]);
#pragma oss task in(A) in(B) inout(C)
void gemm(float A[BS][BS], float B[BS][BS], float C[BS][BS]);
#pragma oss task in(A) inout(B)
void syrk(float A[BS][BS], float B[BS][BS]);
void cholesky_blocked(const int nt, float *A[nt][nt])
{
for (int k = 0; k < nt; k++) {
potrf( A[k][k] );
for (int i = k+1; i < nt; i++) {
trsm( A[k][k],
A[k][i] );
}
for (int i = k + 1; i < nt; i++) {
for (int j = k + 1; j < i; j++) {
gemm( A[k][i],
A[k][j],
A[j][i] );
}
syrk( A[k][i],
A[i][i] );
}
}
}
You will see that this code is not exactly copied from the cholesky.c
file, it is simplified to make it more readable and easier to understand.
cholesky_blocked
takes as input two parameters, the number of tiles (blocks) nt
and an array of pointers A
.
All kernels have as parameters two-dimensional arrays of a fixed size BS
which is the block size.
Therefore, they expect the block input to be consecutive in memory.
Array A
is a matrix of pointers to the blocks.
potrf
is the Cholesky decomposition of a single block, defined in the LAPACK library.
The others, trsm
, gemm
and syrk
are defined in he BLAS spec.
Short description found in BLAS webpage:
trsm
: solving triangular matrix with multiple right hand sidesgemm
: matrix matrix multiplysyrk
: symmetric rank-k update to a matrix
The following animation shows on the right the task graph generated by a 4x4 block matrix (any block size).
On the left, the 4x4 block matrix, the colored block is the output (the inout
parameter), and the grey highlited block are the inputs (the input
parameters).
And here is the final task graph:
The idea is simple: assign each block to a rank, and the task is executed by the rank that has the inout
block assigned to it.
If there are input blocks assigned to other ranks, IMP will take care of moving data.
However, in this benchmark there is a problem with this straight-forward implementation that comes when a single block is copied multiple times on the same rank.
There is a cache implementation of IMP that memorizes previous copies between blocks, and optimizes the ones that hit in the cache.
The amount of optimization depends on several cache parameters, like size and replacement policy.
In this repository there is another implementation that optimizes all copies in the user code.
The reasoning behind these optimizations are explained in a later section.
Asuming that a block is only sent once, the following image shows the task graph of a 3x3 block matrix, and the same task graph when distributing it in two ranks.
The rightmost figure is the sequence of sends/receives between rank 0 (left column of matrices) and 1 (right column of matrices).
The matrix numbers indicate the rank assigned to each block, and the blocks present in both ranks.
We use a cyclic distribution for both rows and columns, which takes the formula trsm
task (labeled 1 in the graphs) at block trsm
of rank 1 is needed later by a syrk
task on rank 0 at block
As we explained before, this implementation optimizes copies in the user code. It is easier to do this way because the user has the knowledge of how the tasks are created, and how the blocks are distributed in memory. The latter is essential to get the optimal number of send/receives. Lets go in order for every possible copy on each kernel:
There are no input blocks, so we will never have communication with this kernel.
There is one input block, potrf
task, and one output block trsm
loop, and since
The numbers on each position of the column represent the rank assigned to that block.
trsm
tasks have already been created.
We can see that rank 1 repeats after creating 3 tasks, because there are 3 ranks and the distribution is cyclic.
This same reasoning applies to every rank, so we know that after creating
This one is a little more tricky, but we use the same reasoning than before.
In fact, all optimizations are based on the fact that the same block is sent to consecutive ranks, so after trsm
tasks,
First, lets look at the first input dependence over block trsm
case, but loop variable gemm
loop, since it modifies variable
Its the same case, but there is an extra variable gemm
loop, because for the same
The second dependence is over block gemm
and syrk
.
The value of the condition is constant for the
Here we see all the tasks that have as input the same block of the second dependence of a gemm
task for a 4x4 block matrix with 4 ranks (showing only the relevant blocks).
First, that block is the first dependence of the gemm
tasks for syrk
task of gemm
with the same
In summary, since we know that gemm
task.
- If
$i-(k+1) < n$ , we know that$j-(k+1) < n$ , so both dependencies may communicate (2 data owners). - If
$i-(k+1) >= n$ and$j-(k+1) < n$ , only the first dependence may communicate (1 data owner). - if
$i-(k+1) >= n$ and$j-(k+1) >= n$ there's no communication (0 data owners).
This one is as easy as trsm
.
syrk
has one input block syrk
task on row gemm
tasks with the same
The idea we use is very simple, but complex to implement: Every rank has allocated in memory a proportional part of the matrix.
I.e. if the matrix is
The numbers represent matrix coordinates
This memory layout is much better for the Cholesky algorithm. Every block is stored consecutive in memory by columns. Therefore, with a single pointer to the block, we can access it as a small isolated matrix of 2x2 without knowing the real dimensions of the original matrix. Now, lets see how every block would be assigned to each rank. From now on, we only represent blocks in the figures, so each square is a full block, not an element of the matrix.
In this example, we have the lower triangle of a 4x4 block matrix. The numbers represent the rank owner of that block, as well as the colors. The first question is how to allocate the blocks on each rank. I.e. after the first block, what should go next, the block on the right, bottom, left? The general idea is to store one anti-diagonal of blocks after the other, from top-left to bottom-right. Then, each anti-diagonal store it from bottom-left to top-right.
This figure represents a bit better the memory layout of the previous example with 3 ranks.
Now, the numbers on each block represent the block coordinates, and the color represents the rank owner.
As you can see, in memory each rank only stores the blocks owned by itself, with the aforementioned order.
This memory layout is very useful because it allow us to get the memory address of any block from the block coordinates with a generic formula.
For example, for rank 0, with just some arithmetic operations we can learn that block
For that, we start with how to calculate the number of elements in a given anti-diagonal
This is a bigger example.
The numbers represent anti-diagonal
One solution to this problem is to subtract the extra part we don't want to count.
This part we will cal it the mirror triangle, because you can see it as a projection of the matrix on a mirror (although it is not an exact projection).
In this case, the number of extra blocks we are counting grows by 1 between consecutive anti-diagonals.
Therefore, the resulting formula is
-
$d/2 + 1$ if$d <= n$ . -
$d/2 + 1 - (d-n+1)$ if$d > n$ .
Now, we can use this formula to calculate how many blocks there are stored from anti-diagonal
In this summation,
However, this simplification introduces a new problem.
This fraction doesn't truncate the decimals of the divisions inside the summation.
I.e. even when truncating the division by 4, the formula is still counting decimals that we don't want to include.
For example, when
Now, we can go for the final formula when
Result after simplification is:
The only part left is taking into account there are multiple ranks. We apply the same idea, however we have to change some parts of the formula. Imagine the previous example with 3 ranks, on rank 1 we would have:
In this case, we can see that the summation doesn't start at anti-diagonal 0, it starts at the rank index
This time,
- If
$r$ is even and$s$ is even, we are only adding numbers that are multiple of 2, so the correction factor is$0$ . - If
$r$ is even and$s$ is odd, we have the first explained case and the correction factor is$(d+1)/4$ . - If
$r$ is odd and$s$ is even, we are always adding numbers that are not multiple of 2, so all add a$0.5$ . The correction factor is$(d+1)/2$ . - If
$r$ is odd and$s$ is odd, we have the second explained case and the correction factor is$(d+2)/4$ .
From now on, we call the correction factor
We only have one case left, when
- If
$r >= (n \bmod s)$ , distance is$r - (n \bmod s)$ - If
$r < (n \bmod s)$ , distance is$r+s - (n \bmod s)$
We call this distance, or the mirror triangle offset,
- If
$r >= (n \bmod s)$ , the number of anti-diagonals in rank$r$ is$n/s$ . - If
$r < (n \bmod s)$ , the number of anti-diagonals in rank$r$ is$n/s + 1$ .
Now we can have the final formula, taking into account that
And the final formula without summations:
With this, we can get from the global block coordinates
- If
$d >= n$ both globals, then the offset is$n-1-i$ . - If
$d < n$ both globals, then the offset is$j$ .
The potrf
kernel FPGA implementation is not very interesing.
It doesn't give much margin for optimizations due to the access patterns and dependencies between iterations.
There are pipelining directives, but the initiation interval can't reach 1.
To give a little boost, the loops are manually unrolled by a factor of 2.
However, the total execution time of potrf
compared to the rest is very small so we didn't dedicate much efforts in optimizing the kernel.
The trsm
solves a triangular magtrix equation, storing the result in block
The second loop is a nested loop.
It is pipelined with a factor of 4 (it could be 1 but this way we reduce resource usage of this kernel), and the inner loop is also unrolled implicitly by the pipeline directive.
Since tmp_row
and
This is a regular matrix multiply in column-major order with the trsm
.
This one is very similar to gemm
, but it multiplies a symmetric matrix by its transpose.
Also, the code only updates the lower-triangular half of the block.
However, you will see the innermost ts
(block size).
Instead of starting at
Sadly, there is no official support in the clang compiler for OMPIF and IMP, so we have to split manually the FPGA and the host part. The host code is under src
.
In cholesky.c
you will find the definitions of all kernels, the FPGA code and a call to OpenBLAS or MKL depending on some defines.
That code is not used, it is there because the compiler needs a definition of the functions.
In execution time, the code will search for the FPGA accelerators, which are implemented in hls under the hls
directory.
So, first you need an OmpSs@FPGA installation with broadcaster support for both in Nanos6 and clang. Also, you need the AIT version that implements Ethernet subsystem and OMPIF support. For the moment these versions are not official nor public, you can ask for access to [email protected].
In summary, to compile the host executable run make
.
To generate the bitstream run make ait
By default, it generates the bitstream from the HLS phase, but you can change the start and finish phases with the Makefile variables TO_STEP
and FROM_STEP
.
Besides that, there are other variables:
- FPGA_CLOCK: Frequency in MHz at which the accelerators will run.
- FPGA_MEMPORT_WIDTH: Data bit-width of the memory port for all the accelerators. More bit-width may provide more bandwidth (depending on the FPGA memory path), at the cost of more resource usage. IMPORTANT This variable is intended to be used in the task pragma. Since there is no compiler support, if you want to change the port width you have to modify the hls code on each kernel individually. By default it is 256.
- BLOCK_SIZE: Size of one of the dimensions of the block matrices, which are square. IMPORTANT The block size affects both the host and hls code, so if you modify the variable, you have to apply the changes manually in all hls codes.
- SYRK_NUM_ACCS: Number of
syrk
accelerators. IMPORTANT If you want to change this variable, change thenum_instances
field of theait_extracted.json
file. Usually this file is generated by clang, but since we do not have that support with OMPIF or IMP, we have to modify the file manually. - GEMM_NUM_ACCS: Number of
gemm
accelerators. IMPORTANT Same as SYRK_NUM_ACCS. - TRSM_NUM_ACCS: Number of
trsm
accelerators. IMPORTANT Same as SYRK_NUM_ACCS. - POTRF_SMP: This variable must be always 0 for this version of the benchmark. It is used to offload the
potrf
kernel to the CPU from the FPGA, but this is not supported in the distributed version with many FPGAs. - FPGA_GEMM_II: The inititation interval of the
gemm
kernel. Setting it to 1 achieves best performance but it uses the most number of resources because you fully unroll the innermost loop, which depends on the block size (an entire block column). By default it is 2, which reduces performance by 2x but also reduces resources by the same amount approximately. IMPORTANT This variable is intended to be used in the FPGA code ofcholesky.c
, but since we don't use it, you have to change the corresponsing variable inomp_gemm.cpp
. - FPGA_OTHER_II: The initiation interval of the
trsm
andsyrk
kernels. The same reasoning with the FPGA_GEMM_II applies with this variable fortrsm
andsyrk
. We separate both because the most critical kernel isgemm
, so we prefer giving more resources to that kernel, and use lower II for the others. IMPORTANT The same as with FPGA_GEMM_II, you have to change the corresponding variable inomp_trsm.cpp
andomp_syrk.cpp
.