CMSC 476: Assignment 4
Overview
Write a C++ program that computes the dot (or scalar or inner) product of two vectors using SIMD parallelism. Achieve this parallelism using AVX intrinsic functions.
Declare LOCAL variables vec1 and vec2 of
type vector<float> to hold the vectors to be
multiplied. Generate random values for each in the range [-4, 4) using a
minstd_rand generator (with seed ONE) and
uniform_real_distribution specialized for
float. Do NOT generate values in parallel for this
assignment, but you should be able to use your SERIAL
fillRandom from a previous assignment.
Write function dotProductSimd to compute the scalar
product of the two vectors. You may assume the vectors are the same
size, and the size is positive. Ensure that EIGHT FP multiplies and adds
are done in parallel with AVX instructions. Specifically, achieve this
by using the AVX Fused
Multiply-Add instruction vfmaddxxxps. I will let you
determine the intrinsic that will generate this instruction. Since there
are two FMA units, unroll your loop, using my unroll
function in the code below, by a factor of at least TWO so TWO
fmadd-s can be issued in parallel (each eight lanes wide).
In the end, you will have at least two vectors of size EIGHT that you
will need to reduce to a single scalar. To do this, research horizontal
SIMD instructions.
We will want to compare our result to a “gold” version that does NOT
use reassociation, named dotProductLibrary. This will be
what the STL produces with std::inner_product. To compare
the two results, use my equality checker. Embed the equality checker and unroller in your own
driver.
As usual, use our Timer class and time only the computational logic and NO output code.
Lastly, write a Makefile that will build your executable
or clean up. Ensure ALL dependencies are properly listed. When I type
make your project MUST produce a release build, with no
warnings (or you get a ZERO).
Input Specification
Input the vector size N of type
unsigned.
Use PRECISELY the format below with the EXACT same SPACING and SPELLING.
N ==> <UserInput>
Output Specification
Output the two products, two times, the speedup for your SIMD version
over dotProductLibrary, and whether the two versions match
(“yes” or “NO!”). Limit products to FOUR decimal places and other values
to TWO decimal places (and use milliseconds for times).
Use PRECISELY the format shown below with the EXACT same spacing and spelling.
Sample Output
(no spaces before the following line)
N ==> 1024
Library: xx.xxxx
Time: xx.xx ms
SIMD: xx.xxxx
Time: xx.xx ms
Speedup: x.xx
Correct: yes
(no spaces after the preceding line)
Required Types, Concepts, and Functions
// Declare these in your SIMD function.
constexpr int VEC_WIDTH = 8;
constexpr int UNROLL_FACTOR = 2;
// Generate random values, using std::ranges::generate.
template<typename T, typename U>
requires std::is_arithmetic_v<T>
void
fillRandom (std::span<T> seq, U min, U max, unsigned seed)
// Compute scalar product serially using std::inner_product.
float
dotProductLibrary (span<float const> a, span<float const> b)
// Compute parallel scalar product using SIMD.
float
dotProductSimd (span<float const> a, span<float const> b)What to Submit
Submit DotProduct.cc, Timer.hpp, and
Makefile. Do NOT rename any of these files.
Hints
Include header file <immintrin.h> for any
intrinsic functions you may need.
Also, see Intrinsics.cc.
Comments
What’s the best unroll factor and why?
Did you test your code thoroughly? Hint: “No” is the wrong answer.
Gary M. Zoppetti, Ph.D.