CMSC 476: Parallel Programming
Assignment 1

Overview

Write a PARALLEL C++ program that computes a general reduction of a vector of integers. To achieve parallelism, spawn 'p' CHILD processes, and have each compute a local reduction on a contiguous block of elements (use the partitioning formula from class to ensure the load is balanced). Have each child send its local result to the parent, which will compute the global result and print it along with the time it took to compute. This is an all-to-one, NON-tree-based reduction similar to Algorithm 1 from class, but the parent process ONLY accumulates partial results — it does NOT compute a local result of its own. To verify correctness, compute the global result serially in the parent process and print it along with the time it took to compute.

Use a non-global variable 'A' of type 'vector<int>' to hold the numbers to be summed. Each process will be able to access it even though each process has a separate address space. This is due to a mechanism called Copy-on-Write.

Write a function that performs a local reduction of a contiguous block of elements, and sends it to the parent process. To do this you will need to use some form of Interprocess Communication (IPC) (e.g., shared memory, pipes, FIFOs, or sockets). Do NOT return the partial result by using a process's exit status, as it is meant to return only an 8-bit status indicating whether the process terminated successfully.

Restrict the 'A' values to be between 0 and 4, inclusive, to make overflow unlikely. You MUST use uniform_int_distribution and a minstd_rand generator. TEST with various seeds, but for the program you submit, use a seed of ZERO. Generate the random integers in the parent process BEFORE spawning any children.

Time the parallel sum and serial sum using this Timer class. Compile with '-O3' to obtain timing values. For the parallel sum, start the timer RIGHT before creating the child processes, and stop it immediately after the global result is computed (BEFORE doing any output). For the serial reduction, start the timer RIGHT before the call to transform_reduce, and stop it RIGHT afterwards. NEVER time any output code.

For the version you submit, use std::identity as your transformer and std::plus as your combiner. This should result in your program computing a simple sum, which should match my results. Experiment with other transformers and combiners too see how serial and parallel runtimes compare.

Use GOOD sofware engineering practices! Have "main" first, and keep it clean and short. Use ALL of the Required Types... listed below. My code has zero "counting for" loops and zero "while" loops. If you want to repeat a segment "n" times use a better construct (like a standard algorithm or range-based "for").

Input Specification

Input the number of child processes 'p' and the vector size 'N'. Assume 1 ≤ p ≤ 16 and N ≥ p.

Use PRECISELY the format below with the EXACT same SPACING and SPELLING.

p ==> <UserInput>

N ==> <UserInput>

Output Specification

Output the hardware concurrency (via std::thread), parallel reduction result, the parallel time, the serial reduction result, and the serial time. For the times use milliseconds and LIMIT the decimal places to TWO.

Use PRECISELY the format below with the EXACT same SPACING and SPELLING. Each deviation in spacing or spelling will result in AT LEAST a five point deduction, likely more.

Sample Output

p ==> 4
(hardware concurrency = 4)

N ==> 100000000

Parallel sum:  199981889
Parallel time: 30.38 ms

Serial sum:    199981889
Serial time:   109.90 ms

Required Types, Concepts, and Functions

std::uniform_int_distribution

std::minstd_rand

// If for some reason you want a C-style array, use this instead!
std::array

// Compute partial reductions (in parallel code) and full serial reduction
std::transform_reduce

// Generate random integers to fill 'A'
// USE a lambda as the last argument. 
std::ranges::generate

// Concept for binary function that takes two T-s and returns a T
template <typename F, typename T>
concept BinaryFunction = std::regular_invocable<F, T, T> and
                         std::same_as<T, std::invoke_result_t<F, T, T>>;

// Concept for unary function that takes a T and returns an object
//   convertible to a T
template <typename F, typename T>
concept UnaryFunction = std::regular_invocable<F, T> and
                        std::convertible_to<std::invoke_result_t<F, T>, T>;

// Compute a parallel reduction on a contiguous subrange
//   of elements of 'v' using process-level parallelism.
// Uses 'numProcesses' processes. 
int
transformReducePar (unsigned numProcesses,
                    std::span<int const> v,
                    int init,
                    BinaryFunction<int> auto combiner,
                    UnaryFunction<int> auto transformer);

// Compute a reduction on a contiguous subrange of elements of 'v'.
// 'id': ID of the process, in the range [0, 'numProcesses').
// Use our partitioning logic so multiple cores run this function 
//   in parallel. 
int
transformReduceOnProc (unsigned id,
                       unsigned numProcesses,
                       std::span<int const> v,
                       int init,
                       BinaryFunction<int> auto combiner,
                       UnaryFunction<int> auto transformer);

What to Submit

Submit your C++ driver file AND "Timer.hpp". Do NOT rename "Timer.hpp". Ensure you are MATCHING my results (your times will likely be different).

Hints

You will need to include at least the following header files: <algorithm>, <numeric>, <ranges>, <unistd.h>, and <sys/wait.h>. Do NOT rely on TRANSITIVE includes — IWYU!

See the Forking example. For IPC consider half-duplex pipes. You will also find Beej's Guide to Unix IPC helpful.

Comments

There's a good discussion on SO regarding threads and processes in Linux.


Gary M. Zoppetti, Ph.D.