CMSC 476: Assignment 2
Overview
Write a C++ program to multiply two square matrices of size
N. Use the cubic algorithm we discussed in class, but code
FIVE different versions (ijk, jki,
kij, block, and parallel). The
ijk variants indicate the order of the loops, the
block version uses whatever order you think is best and
blocking, and the parallel version uses whatever order you
think is best and multiple threads. Input N and the version
from the user. If the version is block, also input a block
size.
Use uniform_int_distribution or
uniform_real_distribution, depending on your Matrix type,
and a minstd_rand engine seeded with ONE. For the version
you submit, use Matrix<double> and fill both
A and B with double-s in the range [ -4.0,
+4.0 ]. Test with other ranges and seeds, but submit the version with
the parameters above. Compute the product in matrix C.
Since N will vary you will need to dynamically allocate
space for each matrix. We will do this with our own templated Matrix class, with support for subscripting
using the C++20 multi-dimensional subscript operator:
A[row, col]. To ensure the matrix is stored on a cache
block boundary, use an aligned new expression:
new (std::align_val_t (64)) T[numElems]. See C++
Aligned new[]. By leveraging std::unique_ptr we ensure
we have no memory leaks.
Time the appropriate multiplication function (and ONLY this function)
using our standard Timer class. Use values of
N of 1024, 1408, and 1792, for each of the five versions.
Thus, you will have 15 times total for your timing results table. Obtain
these times by averaging over at least FIVE trials, and compiling with
-O3. Include this ASCII table at the TOP of your source
file below the standard comment block.
We need to ensure our matrix product is correct. To verify it, we will leverage a reliable package used by NumPy, R, and Julia, namely OpenBLAS. OpenBLAS is a high-performance implementation of the BLAS and LAPACK APIs, with highly-tuned routines. Complete Blas.hpp so that it correctly invokes a suitable OpenBLAS routine.
ENSURE all versions compute the same product for the same value of
N. Use equalMatrices to see if two matrices
are equivalent. An incorrect multiply routine will result in a MASSIVE
deduction. As usual, AVOID global variables.
To build your project, use my Makefile. For all remaining assignmments, you MUST submit a Makefile.
Input Specification
Input the size of the matrices (N) and the version
(ijk, jki, kij,
block, or parallel). If version is
block, also input a block size (blockSize).
You may assume inputs are valid, AND that blockSize divides
N.
Use PRECISELY the format below with the EXACT same SPACING and SPELLING.
N ==> <UserInput>
Version ==> <UserInput>
B ==> <UserInput> // Only if version is 'block'
Output Specification
Output whether the product is correct, specifically the strings “yes” or “NO!”. Then output the time it took to compute the product, using milliseconds and LIMITING the decimal places to TWO.
Use PRECISELY the format below with the EXACT same SPACING and SPELLING. Each deviation in spacing or spelling will result in AT LEAST a ten point deduction.
Sample Output
N ==> 50
Version ==> ijk
Correct: yes
Time: 0.20 ms
Required Types, Concepts, and Functions
Include one multiplication function for each version in a driver file.
std::uniform_int_distribution<>
std::uniform_real_distribution<>
// Prompt and get input.
std::tuple<unsigned, std::string, unsigned>
getInput ();
// Output the final results.
void
printResults (// Args are up to you);
// Populate a matrix with values in the range [ min, max ].
// Use "if constexpr" to choose the correct distribution.
template<typename T, typename U>
void
fillRandom (Matrix<T>& matrix, U min, U max);
// Compute C = A * B.
// Make NO assumptions about the initial contents of C.
template<typename T>
void
multiplyIjk (Matrix<T>& C, Matrix<T> const& A, Matrix<T> const& B);
template<typename T>
void
multiplyJki (Matrix<T>& C, Matrix<T> const& A, Matrix<T> const& B);
template<typename T>
void
multiplyKij (Matrix<T>& C, Matrix<T> const& A, Matrix<T> const& B);
template<typename T>
void
multiplyBlock (Matrix<T>& C, Matrix<T> const& A, Matrix<T> const& B,
unsigned blockSize);
template<typename T>
void
multiplyParallel (Matrix<T>& C, Matrix<T> const& A, Matrix<T> const& B);What to Submit
Submit your C++ driver file, Matrix.hpp,
Blas.hpp, Timer.hpp, and a
Makefile. Do NOT rename any of these
files.
Format your runtime table as below.
Ijk Jki Kij Block Parallel
1024 xxxxx.xx xxxxx.xx xxxxx.xx xxxxx.xx xxxxx.xx
1408
1792
Hints
Ensure you are using
clang-format.Compile with
-Walland remove ALL warnings.For your parallel version I STRONGLY suggest you use OpenMP. Use a
parallel fordirective via the attribute syntax to instruct the compiler to create a team of threads and divide the iteration space for theforloop that follows. The necessary attribute is[[omp::directive (parallel for)]]. You will have to think carefully about WHETHER the loop that directly follows the attribute is suitable for parallelization.
Comments
Can you beat my parallel time? Ensure you can explain every line in your solution.
How fast is the OpenBLAS multiply? Can you beat it?
Gary M. Zoppetti, Ph.D.