Mpi matrix determinant

From Vmateevitsi wiki
Jump to: navigation, search

Contents

Description

For this project I implemented a parallel version to calculate the determinant of a nxn matrix. The program creates a random array of nxn numbers, distributes the array across all the nodes of the cluster and then calculates the determinant. The program also calculates the processing time, communication time and overall time to find the determinant. It outputs the results in either a human readable form, or in a tab separated output.

Source code

You can download the source code of the application here: https://github.com/mvictoras/MPIDeterminant

git clone https://github.com/mvictoras/MPIDeterminant

Compilation

I have included a Makefile to compile the project.

make

Running

The project can be run on ARGO, that is a University of Illinois at Chicago cluster available for students and staff. If you are on ARGO (or any cluster that processes are submitted using TORQUE), you need to run:

./submit.sh 
-n <# of numbers> 
-p <# of nodes> 
-k <# of processors per node> 
-t <formulation: ring, mesh>

If your cluster allows you to run directly programs (without process queues, then you run:

mpirun -np <num_of_nodes> determinant 
-n <# of numbers> 
-p <# of nodes> 
-t <formulation: ring, mesh> 
[-c]

All the arguments are mandatory. The -c option is optional. By default the program prints the results into a human readable format, -c outputs in a tab separated format, so that the data can be easily read and processed by other processes.

Examples for running on ARGO

Create 2520x2520 random numbers and use 1 processor per 1 node (total of 1 node). The formulation used is ring.

./submit.sh -n 2520 -p 1 -k 1 -t ring

Create 5040x5040 random numbers and use 3 nodes with 3 processors per node (total 9 nodes). The formulation used is mesh.

./submit.sh -n 5040 -p 3 -k 3 -t mesh

Implementation Documentation

To find the determinant, the program initial performs LU decomposition. After the LU decomposition, the determinant is the product of the U(i,i) elements. I have implemented two formulations to find the determinant: Ring, Mesh

Ring Formulation (1-D partitioning)

In this formulation I have implemented the 1-D partitioning algorithm for LU decomposition.

void one_d_partitioning(MPI_Comm *comm_new, float *A, int local_rank, int num_procs)

Random number generation and Distribution

The number of rows of the matrix is specified using the -n argument. A matrix of size nxn is created with randomly generated numbers (1 to 21). The array is created by a process and distributed to all processes. The whole nxn array is distributed.

float *generate_2d(MPI_Comm *comm_new, int local_rank, int num_procs, char *proc_name, int *elem_per_node);

Topology Definition

Since the matrix is divided row-wise, the topology I used is ring.

Initially create the topology:

int dims[1], periods[1];
 
dims[0] = *num_procs;
periods[0] = 1;
MPI_Cart_create(MPI_COMM_WORLD, 1, dims, periods, 0, comm_new);


Parameter Ranges and Results

n Total nodes p k Generate Array Processing Communication Total
2520 1 1 1 0.349549 65.355899 0.000022 65.70551
2520 2 2 1 0.374359 65.602441 58.433397 124.168989
2520 3 3 1 0.350964 66.063733 113.249997 179.056975
2520 4 4 1 0.376253 66.24056 165.235964 231.001601
2520 5 5 1 0.351781 67.03272 221.333539 287.641028
2520 6 6 1 0.355156 67.555373 275.556066 342.141924
2520 7 7 1 0.377779 67.227621 325.870324 391.912334
5040 1 1 1 1.402526 519.777896 0.000022 521.180486
5040 2 2 1 1.50208 521.905913 460.704307 983.147034
5040 3 3 1 1.497728 522.088315 882.995594 1404.194193
5040 4 4 1 1.502224 523.633422 1293.019398 1814.801965
5040 5 5 1 1.660403 524.930325 1705.690324 2227.902069
5040 6 6 1 1.563033 529.768179 2131.055101 2657.114536
5040 7 7 1 1.579967 529.385093 2532.406406 3057.130927
10080 1 1 1 5.997568 4155.797147 0.000014 4161.794754
10080 2 2 1 6.15315 4154.163903 3655.624943 7810.248996
10080 3 3 1 5.987953 4199.441065 7049.770938 11245.641297
10080 4 4 1 5.600826 4202.801754 10316.96662 14511.936271
10080 5 5 1 5.993899 4162.393357 13416.567312 17567.626904
10080 6 6 1 6.3127 4170.364882 16624.589147 20780.218372
10080 7 7 1 6.194211 4177.857107 19824.677576 23983.717865

Analysis

Total execution time

pChart

pChart

pChart

Total Aggregated Time

pChart

pChart

pChart

The problem with this formulation is that each node needs to wait for the upper node to finish the processing. This leads to a serial execution of the nodes. By looking at the charts it is obvious that the processing time does not change as more nodes are needed, instead the communication time increases.

Mesh Formulation (2-D partitioning)

In this formulation I have implemented the 2-D partitioning algorithm for LU decomposition.

void one_d_partitioning(MPI_Comm *comm_new, float *A, int local_rank, int num_procs)

Random number generation and Distribution

The number of rows of the matrix is specified using the -n argument. A matrix of size nxn is created with randomly generated numbers (1 to 21). The array is created by the root process and divided across processes. Each process receives n / sqrt(p) rows and n / sort(p) columns. The initial generated array is deleted before the processing begins.

float *generate_mesh(MPI_Comm *comm_new, int local_rank, int num_procs, char *proc_name, int *elem_per_node);

Topology Definition

Since the matrix is divided amongst processes, mesh topology is used. Each node processes his local elements and then sends data to the right and bottom nodes only.

Initially create the topology:

int *dims, i, *periods, nodes_per_dim;
MPI_Comm_size(MPI_COMM_WORLD, num_procs);
 
int dimension = 2;
nodes_per_dim = (int) sqrt( (double) *num_procs );
dims = (int *) malloc(sizeof(int) * dimension);
periods = (int *) malloc(sizeof(int) * dimension);
for( i = 0; i < dimension; i++ ) {
  dims[i] = nodes_per_dim;
  periods[i] = 0;
}
 
MPI_Cart_create(MPI_COMM_WORLD, dimension, dims, periods, 0, comm_new);

Parameter Ranges and Results

Since I am using a mesh topology, the number of nodes needs to be a perfect square. ARGO does not allow the use of more than 8 processor, so in order to use 9 and 16, I had to use logical nodes as well.

n Total nodes p k Generate Array Processing Communication Total
2520 1 1 1 0.389633 2920.768867 0.059432 2921.332974
2520 4 4 1 0.389247 3334.076862 3033.499067 6370.000133
5040 1 1 1 1.499707 26317.789898 0.186295 26319.863721
5040 4 4 1 1.3995 21705.070296 25273.786826 46988.851786
5040 9 3 3 1.497859 27499.393874 21837.854611 49356.686977
5040 16 4 4 1.395927 41874.636627 28608.243295 70551.81067
10080 9 3 3 5.576703 212229.160292 185534.960151 397837.622208
10080 16 4 4 5.980275 368046.095901 671851.092841 1040038.610402

Analysis

Total execution time

pChart

pChart

pChart

Total Aggregated time

pChart

pChart

pChart

This formulation distributes the workload across the nodes. We can see that in the Total Execution charts. The processing time dominates the communication time and the more nodes we use, the faster the determinant is being computed. The processing time takes less time than in 1-D partitioning, since only parts of the array are distributed. Communication is also less, because only parts of the array are being transmitted and not the entire array. For 10080, p=4, k=4 it seems that while using more processors, it takes more time to find the determinant. This may happen because of other workload of the nodes.