CMSC 476: Assignment 2

Overview

Write a C++ program to multiply two square matrices of size N. Use the cubic algorithm we discussed in class, but code FIVE different versions (ijk, jki, kij, block, and parallel). The ijk variants indicate the order of the loops, the block version uses whatever order you think is best and blocking, and the parallel version uses whatever order you think is best and multiple threads. Input N and the version from the user. If the version is block, also input a block size.

Use uniform_int_distribution or uniform_real_distribution, depending on your Matrix type, and a minstd_rand engine seeded with ONE. For the version you submit, use Matrix<double> and fill both A and B with double-s in the range [ -4.0, +4.0 ]. Test with other ranges and seeds, but submit the version with the parameters above. Compute the product in matrix C.

Since N will vary you will need to dynamically allocate space for each matrix. We will do this with our own templated Matrix class, with support for subscripting using the C++20 multi-dimensional subscript operator: A[row, col]. To ensure the matrix is stored on a cache block boundary, use an aligned new expression: new (std::align_val_t (64)) T[numElems]. See C++ Aligned new[]. By leveraging std::unique_ptr we ensure we have no memory leaks.

Time the appropriate multiplication function (and ONLY this function) using our standard Timer class. Use values of N of 1024, 1408, and 1792, for each of the five versions. Thus, you will have 15 times total for your timing results table. Obtain these times by averaging over at least FIVE trials, and compiling with -O3. Include this ASCII table at the TOP of your source file below the standard comment block.

We need to ensure our matrix product is correct. To verify it, we will leverage a reliable package used by NumPy, R, and Julia, namely OpenBLAS. OpenBLAS is a high-performance implementation of the BLAS and LAPACK APIs, with highly-tuned routines. Complete Blas.hpp so that it correctly invokes a suitable OpenBLAS routine.

ENSURE all versions compute the same product for the same value of N. Use equalMatrices to see if two matrices are equivalent. An incorrect multiply routine will result in a MASSIVE deduction. As usual, AVOID global variables.

To build your project, use my Makefile. For all remaining assignmments, you MUST submit a Makefile.

Input Specification

Input the size of the matrices (N) and the version (ijk, jki, kij, block, or parallel). If version is block, also input a block size (blockSize). You may assume inputs are valid, AND that blockSize divides N.

Use PRECISELY the format below with the EXACT same SPACING and SPELLING.

N       ==> <UserInput>
Version ==> <UserInput>
B       ==> <UserInput> // Only if version is 'block'

Output Specification

Output whether the product is correct, specifically the strings “yes” or “NO!”. Then output the time it took to compute the product, using milliseconds and LIMITING the decimal places to TWO.

Use PRECISELY the format below with the EXACT same SPACING and SPELLING. Each deviation in spacing or spelling will result in AT LEAST a ten point deduction.

Sample Output

N       ==> 50
Version ==> ijk
    
Correct: yes
Time:    0.20 ms

Required Types, Concepts, and Functions

Include one multiplication function for each version in a driver file.

std::uniform_int_distribution<>

std::uniform_real_distribution<>

// Prompt and get input. 
std::tuple<unsigned, std::string, unsigned>
getInput ();

// Output the final results. 
void
printResults (// Args are up to you); 

// Populate a matrix with values in the range [ min, max ].
// Use "if constexpr" to choose the correct distribution. 
template<typename T, typename U>
void
fillRandom (Matrix<T>& matrix, U min, U max);

// Compute C = A * B.
//   Make NO assumptions about the initial contents of C. 
template<typename T>
void
multiplyIjk (Matrix<T>& C, Matrix<T> const& A, Matrix<T> const& B);

template<typename T>
void
multiplyJki (Matrix<T>& C, Matrix<T> const& A, Matrix<T> const& B);

template<typename T>
void
multiplyKij (Matrix<T>& C, Matrix<T> const& A, Matrix<T> const& B);

template<typename T>
void
multiplyBlock (Matrix<T>& C, Matrix<T> const& A, Matrix<T> const& B,
               unsigned blockSize);

template<typename T>
void
multiplyParallel (Matrix<T>& C, Matrix<T> const& A, Matrix<T> const& B);

What to Submit

Submit your C++ driver file, Matrix.hpp, Blas.hpp, Timer.hpp, and a Makefile. Do NOT rename any of these files.

Format your runtime table as below.

           Ijk       Jki       Kij     Block  Parallel
1024  xxxxx.xx  xxxxx.xx  xxxxx.xx  xxxxx.xx  xxxxx.xx
1408
1792

Hints

Ensure you are using clang-format.
Compile with -Wall and remove ALL warnings.
For your parallel version I STRONGLY suggest you use OpenMP. Use a parallel for directive via the attribute syntax to instruct the compiler to create a team of threads and divide the iteration space for the for loop that follows. The necessary attribute is [[omp::directive (parallel for)]]. You will have to think carefully about WHETHER the loop that directly follows the attribute is suitable for parallelization.

Comments

Can you beat my parallel time? Ensure you can explain every line in your solution.

How fast is the OpenBLAS multiply? Can you beat it?

Gary M. Zoppetti, Ph.D.