Implement a kernel the performs reduction of a 1D list. The reduction
should give the sum of the list. You should implement the improved
kernel discussed in week 4. Your kernel should be able to handle input
lists of arbitrary length. However, for simplicity, you can assume that
the input list will be at most 2048 x 65535
elements so
that it can be handled by only one kernel launch. The boundary condition
can be handled by filling “identity value (0 for sum)” into the shared
memory of the last block when the length is not a multiple of the thread
block size. Further assume that the reduction sums of each section
generated by individual blocks will be summed up by the CPU.
Prerequisites
Before starting this lab, make sure that:
Edit the code in the code tab to perform the following:
Instructions about where to place each part of the code is demarcated
by the //@@
comment lines.
The most recent version of source code for this lab along with the build-scripts can be found on the Bitbucket repository. A description on how to use the CMake tool in along with how to build the labs for local development found in the README document in the root of the repository.
The executable generated as a result of compiling the lab can be run using the following command:
./Reduction_Template -e <expected.raw> \
-i <input.raw> -o <output.raw> -t integral_vector
where <expected.raw>
is the expected output,
<input.raw>
is the input dataset, and
<output.raw>
is an optional path to store the
results. The datasets can be generated using the dataset generator built
as part of the compilation process.