Mpi matrix determinant
Contents |
Description
For this project I implemented a parallel version to calculate the determinant of a nxn matrix. The program creates a random array of nxn numbers, distributes the array across all the nodes of the cluster and then calculates the determinant. The program also calculates the processing time, communication time and overall time to find the determinant. It outputs the results in either a human readable form, or in a tab separated output.
Source code
You can download the source code of the application here: https://github.com/mvictoras/MPIDeterminant
git clone https://github.com/mvictoras/MPIDeterminant
Compilation
I have included a Makefile to compile the project.
make
Running
The project can be run on ARGO, that is a University of Illinois at Chicago cluster available for students and staff. If you are on ARGO (or any cluster that processes are submitted using TORQUE), you need to run:
./submit.sh -n <# of numbers> -p <# of nodes> -k <# of processors per node> -t <formulation: ring, mesh>
If your cluster allows you to run directly programs (without process queues, then you run:
mpirun -np <num_of_nodes> determinant -n <# of numbers> -p <# of nodes> -t <formulation: ring, mesh> [-c]
All the arguments are mandatory. The -c option is optional. By default the program prints the results into a human readable format, -c outputs in a tab separated format, so that the data can be easily read and processed by other processes.
Examples for running on ARGO
Create 2520x2520 random numbers and use 1 processor per 1 node (total of 1 node). The formulation used is ring.
./submit.sh -n 2520 -p 1 -k 1 -t ring
Create 5040x5040 random numbers and use 3 nodes with 3 processors per node (total 9 nodes). The formulation used is mesh.
./submit.sh -n 5040 -p 3 -k 3 -t mesh
Implementation Documentation
To find the determinant, the program initial performs LU decomposition. After the LU decomposition, the determinant is the product of the U(i,i) elements. I have implemented two formulations to find the determinant: Ring, Mesh
Ring Formulation (1-D partitioning)
In this formulation I have implemented the 1-D partitioning algorithm for LU decomposition.
void one_d_partitioning(MPI_Comm *comm_new, float *A, int local_rank, int num_procs)
Random number generation and Distribution
The number of rows of the matrix is specified using the -n argument. A matrix of size nxn is created with randomly generated numbers (1 to 21). The array is created by a process and distributed to all processes. The whole nxn array is distributed.
float *generate_2d(MPI_Comm *comm_new, int local_rank, int num_procs, char *proc_name, int *elem_per_node);
Topology Definition
Since the matrix is divided row-wise, the topology I used is ring.
Initially create the topology:
int dims[1], periods[1]; dims[0] = *num_procs; periods[0] = 1; MPI_Cart_create(MPI_COMM_WORLD, 1, dims, periods, 0, comm_new);
Parameter Ranges and Results
| n | Total nodes | p | k | Generate Array | Processing | Communication | Total |
|---|---|---|---|---|---|---|---|
| 2520 | 1 | 1 | 1 | 0.349549 | 65.355899 | 0.000022 | 65.70551 |
| 2520 | 2 | 2 | 1 | 0.374359 | 65.602441 | 58.433397 | 124.168989 |
| 2520 | 3 | 3 | 1 | 0.350964 | 66.063733 | 113.249997 | 179.056975 |
| 2520 | 4 | 4 | 1 | 0.376253 | 66.24056 | 165.235964 | 231.001601 |
| 2520 | 5 | 5 | 1 | 0.351781 | 67.03272 | 221.333539 | 287.641028 |
| 2520 | 6 | 6 | 1 | 0.355156 | 67.555373 | 275.556066 | 342.141924 |
| 2520 | 7 | 7 | 1 | 0.377779 | 67.227621 | 325.870324 | 391.912334 |
| 5040 | 1 | 1 | 1 | 1.402526 | 519.777896 | 0.000022 | 521.180486 |
| 5040 | 2 | 2 | 1 | 1.50208 | 521.905913 | 460.704307 | 983.147034 |
| 5040 | 3 | 3 | 1 | 1.497728 | 522.088315 | 882.995594 | 1404.194193 |
| 5040 | 4 | 4 | 1 | 1.502224 | 523.633422 | 1293.019398 | 1814.801965 |
| 5040 | 5 | 5 | 1 | 1.660403 | 524.930325 | 1705.690324 | 2227.902069 |
| 5040 | 6 | 6 | 1 | 1.563033 | 529.768179 | 2131.055101 | 2657.114536 |
| 5040 | 7 | 7 | 1 | 1.579967 | 529.385093 | 2532.406406 | 3057.130927 |
| 10080 | 1 | 1 | 1 | 5.997568 | 4155.797147 | 0.000014 | 4161.794754 |
| 10080 | 2 | 2 | 1 | 6.15315 | 4154.163903 | 3655.624943 | 7810.248996 |
| 10080 | 3 | 3 | 1 | 5.987953 | 4199.441065 | 7049.770938 | 11245.641297 |
| 10080 | 4 | 4 | 1 | 5.600826 | 4202.801754 | 10316.96662 | 14511.936271 |
| 10080 | 5 | 5 | 1 | 5.993899 | 4162.393357 | 13416.567312 | 17567.626904 |
| 10080 | 6 | 6 | 1 | 6.3127 | 4170.364882 | 16624.589147 | 20780.218372 |
| 10080 | 7 | 7 | 1 | 6.194211 | 4177.857107 | 19824.677576 | 23983.717865 |
Analysis
Total execution time



Total Aggregated Time



The problem with this formulation is that each node needs to wait for the upper node to finish the processing. This leads to a serial execution of the nodes. By looking at the charts it is obvious that the processing time does not change as more nodes are needed, instead the communication time increases.
Mesh Formulation (2-D partitioning)
In this formulation I have implemented the 2-D partitioning algorithm for LU decomposition.
void one_d_partitioning(MPI_Comm *comm_new, float *A, int local_rank, int num_procs)
Random number generation and Distribution
The number of rows of the matrix is specified using the -n argument. A matrix of size nxn is created with randomly generated numbers (1 to 21). The array is created by the root process and divided across processes. Each process receives n / sqrt(p) rows and n / sort(p) columns. The initial generated array is deleted before the processing begins.
float *generate_mesh(MPI_Comm *comm_new, int local_rank, int num_procs, char *proc_name, int *elem_per_node);
Topology Definition
Since the matrix is divided amongst processes, mesh topology is used. Each node processes his local elements and then sends data to the right and bottom nodes only.
Initially create the topology:
int *dims, i, *periods, nodes_per_dim; MPI_Comm_size(MPI_COMM_WORLD, num_procs); int dimension = 2; nodes_per_dim = (int) sqrt( (double) *num_procs ); dims = (int *) malloc(sizeof(int) * dimension); periods = (int *) malloc(sizeof(int) * dimension); for( i = 0; i < dimension; i++ ) { dims[i] = nodes_per_dim; periods[i] = 0; } MPI_Cart_create(MPI_COMM_WORLD, dimension, dims, periods, 0, comm_new);
Parameter Ranges and Results
Since I am using a mesh topology, the number of nodes needs to be a perfect square. ARGO does not allow the use of more than 8 processor, so in order to use 9 and 16, I had to use logical nodes as well.
| n | Total nodes | p | k | Generate Array | Processing | Communication | Total |
|---|---|---|---|---|---|---|---|
| 2520 | 1 | 1 | 1 | 0.389633 | 2920.768867 | 0.059432 | 2921.332974 |
| 2520 | 4 | 4 | 1 | 0.389247 | 3334.076862 | 3033.499067 | 6370.000133 |
| 5040 | 1 | 1 | 1 | 1.499707 | 26317.789898 | 0.186295 | 26319.863721 |
| 5040 | 4 | 4 | 1 | 1.3995 | 21705.070296 | 25273.786826 | 46988.851786 |
| 5040 | 9 | 3 | 3 | 1.497859 | 27499.393874 | 21837.854611 | 49356.686977 |
| 5040 | 16 | 4 | 4 | 1.395927 | 41874.636627 | 28608.243295 | 70551.81067 |
| 10080 | 9 | 3 | 3 | 5.576703 | 212229.160292 | 185534.960151 | 397837.622208 |
| 10080 | 16 | 4 | 4 | 5.980275 | 368046.095901 | 671851.092841 | 1040038.610402 |
Analysis
Total execution time



Total Aggregated time



This formulation distributes the workload across the nodes. We can see that in the Total Execution charts. The processing time dominates the communication time and the more nodes we use, the faster the determinant is being computed. The processing time takes less time than in 1-D partitioning, since only parts of the array are distributed. Communication is also less, because only parts of the array are being transmitted and not the entire array. For 10080, p=4, k=4 it seems that while using more processors, it takes more time to find the determinant. This may happen because of other workload of the nodes.