CMSC 476: Assignment 4

Overview

Write a C++ program that computes the dot (or scalar or inner) product of two vectors using SIMD parallelism. Achieve this parallelism using AVX intrinsic functions.

Declare LOCAL variables vec1 and vec2 of type vector<float> to hold the vectors to be multiplied. Generate random values for each in the range [-4, 4) using a minstd_rand generator (with seed ONE) and uniform_real_distribution specialized for float. Do NOT generate values in parallel for this assignment, but you should be able to use your SERIAL fillRandom from a previous assignment.

Write function dotProductSimd to compute the scalar product of the two vectors. You may assume the vectors are the same size, and the size is positive. Ensure that EIGHT FP multiplies and adds are done in parallel with AVX instructions. Specifically, achieve this by using the AVX Fused Multiply-Add instruction vfmaddxxxps. I will let you determine the intrinsic that will generate this instruction. Since there are two FMA units, unroll your loop, using my unroll function in the code below, by a factor of at least TWO so TWO fmadd-s can be issued in parallel (each eight lanes wide). In the end, you will have at least two vectors of size EIGHT that you will need to reduce to a single scalar. To do this, research horizontal SIMD instructions.

We will want to compare our result to a “gold” version that does NOT use reassociation, named dotProductLibrary. This will be what the STL produces with std::inner_product. To compare the two results, use my equality checker. Embed the equality checker and unroller in your own driver.

As usual, use our Timer class and time only the computational logic and NO output code.

Lastly, write a Makefile that will build your executable or clean up. Ensure ALL dependencies are properly listed. When I type make your project MUST produce a release build, with no warnings (or you get a ZERO).

Input Specification

Input the vector size N of type unsigned.

Use PRECISELY the format below with the EXACT same SPACING and SPELLING.

N ==> <UserInput>

Output Specification

Output the two products, two times, the speedup for your SIMD version over dotProductLibrary, and whether the two versions match (“yes” or “NO!”). Limit products to FOUR decimal places and other values to TWO decimal places (and use milliseconds for times).

Use PRECISELY the format shown below with the EXACT same spacing and spelling.

Sample Output

(no spaces before the following line)
N ==> 1024

Library: xx.xxxx
Time:    xx.xx ms

SIMD:    xx.xxxx
Time:    xx.xx ms
Speedup: x.xx

Correct: yes
(no spaces after the preceding line)

Required Types, Concepts, and Functions

// Declare these in your SIMD function.
constexpr int VEC_WIDTH = 8;
constexpr int UNROLL_FACTOR = 2;

// Generate random values, using std::ranges::generate. 
template<typename T, typename U>
  requires std::is_arithmetic_v<T>
void
fillRandom (std::span<T> seq, U min, U max, unsigned seed)

// Compute scalar product serially using std::inner_product.
float
dotProductLibrary (span<float const> a, span<float const> b)

// Compute parallel scalar product using SIMD.
float
dotProductSimd (span<float const> a, span<float const> b)

What to Submit

Submit DotProduct.cc, Timer.hpp, and Makefile. Do NOT rename any of these files.

Hints

Include header file <immintrin.h> for any intrinsic functions you may need.

Also, see Intrinsics.cc.

Comments

What’s the best unroll factor and why?

Did you test your code thoroughly? Hint: “No” is the wrong answer.

Gary M. Zoppetti, Ph.D.