List Scan

GPU Teaching Kit – Accelerated Computing

Objective

Implement a kernel to perform an inclusive parallel scan on a 1D list. The scan operator will be the addition (plus) operator. You should implement the work efficient kernel in Lecture 4.6. Your kernel should be able to handle input lists of arbitrary length. To simplify the lab, the student can assume that the input list will be at most of length \(2048 \times 65,535\) elements. This means that the computation can be performed using only one kernel launch.

The boundary condition can be handled by filling “identity value (0 for sum)” into the shared memory of the last block when the length is not a multiple of the thread block size.

Prerequisites

Before starting this lab, make sure that:

Instructions

Edit the code in the code tab to perform the following:

Instructions about where to place each part of the code is demarcated by the //@@ comment lines.

Local Setup Instructions

The most recent version of source code for this lab along with the build-scripts can be found on the Bitbucket repository. A description on how to use the CMake tool in along with how to build the labs for local development found in the README document in the root of the repository.

The executable generated as a result of compiling the lab can be run using the following command:

./ListScan_Template -e <expected.raw> \
  -i <input.raw> -o <output.raw> -t vector

where <expected.raw> is the expected output, <input.raw> is the input dataset, and <output.raw> is an optional path to store the results. The datasets can be generated using the dataset generator built as part of the compilation process.