CMSC 476: Assignment 5

Overview

Write a C++ program that computes the dot (or scalar or inner) product of two vectors using SIMD parallelism. Achieve this parallelism via CUDA and Thrust.

Declare LOCAL variables vec1 and vec2 of type thrust::universal_vector<float> to hold the vectors to be multiplied. Generate random values for each in the range [-4, 4) using a minstd_rand generator (with seed ONE) and uniform_real_distribution specialized for float. Do NOT generate values in parallel for this assignment, but you should be able to use your SERIAL fillRandom from a previous assignment. Also, use a universal_vector of size one to hold the result. Using these vectors will ensure we use unified memory that CUDA will manage for us.

Write kernel function dotProductCuda to compute the scalar product of the two vectors. You may assume the vectors are the same size, and the size is positive. Use cooperative groups and partition your blocks into 32-thread warps. Ensure you use an efficient warp-level reduction, and also have warp 0 reduce across the entire block using shared memory. Lastly, each block will have to atomically add its sum to a global sum.

We will want to compare our result to a “gold” version that does NOT use reassociation, named dotProductLibrary. This will be what the STL produces with std::inner_product. To compare the two results, use my equality checker. Embed the equality checker in your own driver.

As usual, use our Timer class and time only the computational logic and NO output code.

Since we cannot easily use C++23 with CUDA, use the fmt library and fmt::println for output instead of std::println. This will mean you have to link against the fmt library.

Lastly, write a Makefile that will build your executable or clean up. Ensure ALL dependencies are properly listed. When I type make your project MUST produce a release build, with no warnings (or you get a ZERO).

Input Specification

Input the vector size N of type unsigned.

Use PRECISELY the format below with the EXACT same SPACING and SPELLING.

N ==> <UserInput>

Output Specification

Output the two products, two times, the speedup for your CUDA version over dotProductLibrary, and whether the two versions match (“yes” or “NO!”). Limit products to FOUR decimal places and other values to TWO decimal places (and use milliseconds for times).

Use PRECISELY the format shown below with the EXACT same spacing and spelling.

Sample Output

(no spaces before the following line)
N ==> 1024

Library: xx.xxxx
Time:    xx.xx ms

CUDA:    xx.xxxx
Time:    xx.xx ms
Speedup: x.xx

Correct: yes
(no spaces after the preceding line)

Required Types, Concepts, and Functions

// Generate random values, using std::ranges::generate. 
template<typename T, typename U>
  requires std::is_arithmetic_v<T>
void
fillRandom (std::span<T> seq, U min, U max, unsigned seed)

// Compute scalar product using CUDA.
// Use a block size of 256 threads, and a 1D grid of as many
//   blocks as are necessary to accomodate 'n'.
__global__ //
void
dotProductCuda (float* const a, float* const b, float* result, uint n)

// Compute scalar product using the STL (on the CPU). 
float
dotProductLibrary (span<float const> a, span<float const> b)

What to Submit

Submit DotProductCuda.cu, Timer.hpp, and Makefile. Do NOT rename any of these files.

Hints

See VectorAdd.cu.

Comments

How does your GPU time compare to your vectorized CPU time?

Gary M. Zoppetti, Ph.D.