In C++ (and CUDA), we use headers to include predefined functions and classes. The most common header is <iostream>, which provides input/output functionality. For example:
#include <iostream>int: Represents whole numbers (e.g., 5, -10, 0).short: Short integers with smaller range.long: Long integers with larger range.
float: Single-precision floating-point (e.g., 3.14, -0.001).double: Double-precision floating-point (more accurate, but uses more memory).
char: Represents individual characters (e.g., 'A', 'b', '$').bool: Represents true or false values.
-
Use
constto define constants (values that don't change during program execution). -
Example:
const float PI = 3.14159;
-
Repeats a block of code a specified number of times.
-
Example:
for (int i = 0; i < 10; ++i) { // Code to execute }
-
Repeats a block of code while a condition is true.
-
Example:
int count = 0; while (count < 5) { // Code to execute ++count; }
- Every C++ program starts with the
mainfunction. - It's the entry point of your program, where execution begins.
- The
intbeforemainindicates that the function returns an integer (usually 0 for successful execution).
- A kernel is a function that runs on the GPU.
- It's the heart of CUDA programming.
- Kernels are defined using the
__global__keyword.
Example of a simple kernel:
__global__ void vectorAdd(float* A, float* B, float* C, int size) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < size) {
C[tid] = A[tid] + B[tid];
}
}- In CUDA, we work with threads.
- Each thread has a unique index.
- We calculate the thread index using
blockIdx.x,blockDim.x, andthreadIdx.x.
- We allocate memory for data on both the CPU (host) and GPU (device).
cudaMallocallocates memory on the device.cudaMemcpytransfers data between host and device.
- We launch the kernel using
<<<numBlocks, threadsPerBlock>>>syntax. numBlocksandthreadsPerBlockdetermine the grid and block dimensions.
- After using GPU memory, we free it using
cudaFree. - Also, delete any host memory allocated with
new.