CMSC 476: Assignment 5
Overview
Write a C++ program that computes the dot (or scalar or inner) product of two vectors using SIMD parallelism. Achieve this parallelism via CUDA and Thrust.
Declare LOCAL variables vec1 and vec2 of
type thrust::universal_vector<float> to hold the
vectors to be multiplied. Generate random values for each in the range
[-4, 4) using a minstd_rand generator (with seed ONE) and
uniform_real_distribution specialized for
float. Do NOT generate values in parallel for this
assignment, but you should be able to use your SERIAL
fillRandom from a previous assignment. Also, use a
universal_vector of size one to hold the result. Using
these vectors will ensure we use unified memory that CUDA will manage
for us.
Write kernel function dotProductCuda to compute the
scalar product of the two vectors. You may assume the vectors are the
same size, and the size is positive. Use cooperative groups and
partition your blocks into 32-thread warps. Ensure you use an efficient
warp-level reduction, and also have warp 0 reduce across the entire
block using shared memory. Lastly, each block will have to atomically
add its sum to a global sum.
We will want to compare our result to a “gold” version that does NOT
use reassociation, named dotProductLibrary. This will be
what the STL produces with std::inner_product. To compare
the two results, use my equality checker. Embed the equality checker in your own driver.
As usual, use our Timer class and time only the computational logic and NO output code.
Since we cannot easily use C++23 with CUDA, use the fmt
library and fmt::println for output instead of
std::println. This will mean you have to link against the
fmt library.
Lastly, write a Makefile that will build your executable
or clean up. Ensure ALL dependencies are properly listed. When I type
make your project MUST produce a release build, with no
warnings (or you get a ZERO).
Input Specification
Input the vector size N of type
unsigned.
Use PRECISELY the format below with the EXACT same SPACING and SPELLING.
N ==> <UserInput>
Output Specification
Output the two products, two times, the speedup for your CUDA version
over dotProductLibrary, and whether the two versions match
(“yes” or “NO!”). Limit products to FOUR decimal places and other values
to TWO decimal places (and use milliseconds for times).
Use PRECISELY the format shown below with the EXACT same spacing and spelling.
Sample Output
(no spaces before the following line)
N ==> 1024
Library: xx.xxxx
Time: xx.xx ms
CUDA: xx.xxxx
Time: xx.xx ms
Speedup: x.xx
Correct: yes
(no spaces after the preceding line)
Required Types, Concepts, and Functions
// Generate random values, using std::ranges::generate.
template<typename T, typename U>
requires std::is_arithmetic_v<T>
void
fillRandom (std::span<T> seq, U min, U max, unsigned seed)
// Compute scalar product using CUDA.
// Use a block size of 256 threads, and a 1D grid of as many
// blocks as are necessary to accomodate 'n'.
__global__ //
void
dotProductCuda (float* const a, float* const b, float* result, uint n)
// Compute scalar product using the STL (on the CPU).
float
dotProductLibrary (span<float const> a, span<float const> b)What to Submit
Submit DotProductCuda.cu, Timer.hpp, and
Makefile. Do NOT rename any of these files.
Hints
See VectorAdd.cu.
Comments
How does your GPU time compare to your vectorized CPU time?
Gary M. Zoppetti, Ph.D.