Heterogeneous Data Parallel Computing
Environment note: Examples are validated in containers using CUDA 13.0.1 (Ubuntu 24.04) and ROCm 7.0.1 (Ubuntu 24.04). The advanced build system automatically detects your GPU vendor and optimizes accordingly. Using Docker is recommended for a consistent setup.
After completing this module, you will be able to:
- Understand the fundamental differences between CPU and GPU architectures
- Set up CUDA and HIP development environments
- Write, compile, and execute basic GPU kernels
- Manage GPU memory allocation and data transfers
- Debug common GPU programming issues
- Apply basic optimization techniques for parallel execution
Welcome to the world of GPU programming! This module introduces the fundamentals of programming GPUs using NVIDIA's CUDA and AMD's HIP within the ROCm framework. We'll use the NVIDIA H100 and AMD MI300X as our reference platforms - representing the cutting edge of parallel computing in 2025.
Traditional CPUs excel at sequential processing with complex instruction sets and large caches, while GPUs are designed for massively parallel workloads. Consider these key differences:
CPU Architecture:
- Few cores (8-64 typically)
- Complex out-of-order execution
- Large caches (MB per core)
- Optimized for single-threaded performance
GPU Architecture:
- Thousands of cores (10,000+ on H100)
- Simple in-order execution
- Small caches shared among many cores
- Optimized for throughput, not latency
This makes GPUs ideal for:
- Machine learning training and inference
- Scientific simulations (weather, molecular dynamics)
- Image and video processing
- Cryptocurrency mining
- Real-time graphics rendering
NVIDIA H100:
- 141GB HBM3 memory
- 3,000+ TFLOPS of performance
- 14,592 CUDA cores
- Memory bandwidth: 3.35 TB/s
AMD MI300X:
- 192GB HBM3e memory
- 5.2 TB/s memory bandwidth
- 304 Compute Units
- Advanced packaging with CPU and GPU on same die
GPUs use the SIMT execution model, where groups of threads execute the same instruction on different data elements. This is different from SIMD (Single Instruction, Multiple Data) because threads can diverge and follow different execution paths.
Key Concepts:
- Warp (NVIDIA) or Wavefront (AMD): Group of 32 threads that execute together
- Streaming Multiprocessor (SM): Contains multiple warp schedulers and execution units
- Thread Hierarchy: Individual threads → Warps → Thread Blocks → Grid
Understanding GPU memory is crucial for performance:
┌─────────────────────────────────────┐
│ Host (CPU) Memory │ ← Slowest, largest
├─────────────────────────────────────┤
│ Global GPU Memory │ ← Large, moderate speed
├─────────────────────────────────────┤
│ Shared Memory │ ← Fast, small, per-block
├─────────────────────────────────────┤
│ Registers │ ← Fastest, smallest, per-thread
└─────────────────────────────────────┘
Memory Types:
- Global Memory: Large but slow, accessible by all threads
- Shared Memory: Fast, on-chip memory shared within a thread block
- Registers: Fastest memory, private to each thread
- Constant Memory: Read-only, cached, good for frequently accessed data
- Texture Memory: Specialized for spatial locality access patterns
CUDA (Compute Unified Device Architecture)
- NVIDIA's proprietary platform
- Mature ecosystem with extensive libraries (cuBLAS, cuDNN, cuFFT)
- C/C++ extensions with device-specific optimizations
- Comprehensive debugging and profiling tools
HIP (Heterogeneous-Compute Interface for Portability)
- AMD's open-source alternative
- Designed for portability between AMD and NVIDIA GPUs
- Can compile to both ROCm (AMD) and CUDA (NVIDIA) backends
- Growing ecosystem with ROCm libraries
- A compatible GPU (NVIDIA or AMD)
- Linux, Windows, or macOS (Linux recommended for production)
- C/C++ compiler (gcc, clang, or MSVC)
- Basic command-line knowledge
Step 1: Install NVIDIA Drivers
# Ubuntu/Debian
sudo apt update
sudo apt install nvidia-driver-550
# Check installation
nvidia-smiStep 2: Install CUDA Toolkit
# Download from developer.nvidia.com
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-6
# Add to PATH
echo 'export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}' >> ~/.bashrc
source ~/.bashrcStep 3: Verify Installation
nvcc --version
nvidia-smiStep 1: Install ROCm
# Ubuntu 22.04/24.04
wget https://repo.radeon.com/amdgpu-install/7.0/ubuntu/jammy/amdgpu-install_7.0.60000-1_all.deb
sudo apt install ./amdgpu-install_7.0.60000-1_all.deb
sudo amdgpu-install --usecase=hiplibsdk,rocm
# Add user to video group
sudo usermod -a -G video $USER
sudo usermod -a -G render $USERStep 2: Verify Installation
hipcc --version
rocminfo
/opt/rocm/bin/rocm-smiRecommended IDEs:
- Visual Studio Code with CUDA/HIP extensions
- CLion with GPU debugging support
- Nsight Eclipse Edition (NVIDIA)
Essential Tools:
nvprof/ncu(NVIDIA profiling)rocprof(AMD profiling)cuda-gdb/rocgdb(debugging)
Create this test script to verify your setup:
#!/bin/bash
echo "=== GPU Programming Environment Check ==="
# Check for NVIDIA
if command -v nvidia-smi &> /dev/null; then
echo "✓ NVIDIA GPU detected"
nvidia-smi --query-gpu=name,memory.total,compute_cap --format=csv
fi
# Check for AMD
if command -v rocm-smi &> /dev/null; then
echo "✓ AMD GPU detected"
rocm-smi --showproductname --showmeminfo vram
fi
# Check compilers
if command -v nvcc &> /dev/null; then
echo "✓ NVCC available: $(nvcc --version | grep release)"
fi
if command -v hipcc &> /dev/null; then
echo "✓ HIPCC available: $(hipcc --version | head -1)"
fiA CUDA program consists of:
- Host code: Runs on CPU, manages GPU
- Device code: Runs on GPU, performs parallel computation
- Memory management: Allocate/transfer data between CPU and GPU
- Kernel launch: Execute parallel function on GPU
#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
// CUDA kernel - runs on GPU
__global__ void addVectors(float *a, float *b, float *c, int n) {
// Calculate global thread index
int i = threadIdx.x + blockIdx.x * blockDim.x;
// Boundary check to prevent out-of-bounds access
if (i < n) {
c[i] = a[i] + b[i];
}
}
// Error checking macro
#define CUDA_CHECK(call) \
do { \
cudaError_t error = call; \
if (error != cudaSuccess) { \
fprintf(stderr, "CUDA error at %s:%d - %s\n", __FILE__, __LINE__, \
cudaGetErrorString(error)); \
exit(EXIT_FAILURE); \
} \
} while(0)
int main() {
const int N = 1024; // Vector size
const int bytes = N * sizeof(float);
// Host vectors
float *h_a, *h_b, *h_c;
// Device vectors
float *d_a, *d_b, *d_c;
// Allocate host memory
h_a = (float*)malloc(bytes);
h_b = (float*)malloc(bytes);
h_c = (float*)malloc(bytes);
// Initialize input vectors
for (int i = 0; i < N; i++) {
h_a[i] = sin(i) * sin(i);
h_b[i] = cos(i) * cos(i);
}
// Allocate device memory
CUDA_CHECK(cudaMalloc(&d_a, bytes));
CUDA_CHECK(cudaMalloc(&d_b, bytes));
CUDA_CHECK(cudaMalloc(&d_c, bytes));
// Copy data to device
CUDA_CHECK(cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice));
CUDA_CHECK(cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice));
// Launch kernel: 4 blocks of 256 threads each
int blockSize = 256;
int gridSize = (N + blockSize - 1) / blockSize; // Ceiling division
addVectors<<<gridSize, blockSize>>>(d_a, d_b, d_c, N);
// Check for kernel launch errors
CUDA_CHECK(cudaGetLastError());
// Wait for GPU to finish
CUDA_CHECK(cudaDeviceSynchronize());
// Copy result back to host
CUDA_CHECK(cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost));
// Verify result (first few elements)
printf("Verification (first 5 elements):\n");
for (int i = 0; i < 5; i++) {
printf("%.3f + %.3f = %.3f\n", h_a[i], h_b[i], h_c[i]);
}
// Clean up memory
free(h_a); free(h_b); free(h_c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
printf("Vector addition completed successfully!\n");
return 0;
}1. Kernel Function (__global__)
- Executed on GPU by many threads in parallel
- Cannot return values (use void)
- Accessed from host code
2. Thread Indexing
int i = threadIdx.x + blockIdx.x * blockDim.x;threadIdx.x: Thread ID within block (0-255 for blockSize=256)blockIdx.x: Block ID within gridblockDim.x: Number of threads per block
3. Memory Management
cudaMalloc(): Allocate GPU memorycudaMemcpy(): Transfer data between CPU and GPUcudaFree(): Free GPU memory
4. Kernel Launch Syntax
kernelName<<<gridSize, blockSize>>>(parameters);# Compile
nvcc -o 01_vector_addition_cuda 01_vector_addition_cuda.cu
# Run
./01_vector_addition_cudaExpected output:
Verification (first 5 elements):
0.000 + 1.000 = 1.000
0.708 + 0.293 = 1.000
0.841 + 0.159 = 1.000
0.990 + 0.010 = 1.000
0.654 + 0.346 = 1.000
Vector addition completed successfully!
HIP provides a single-source C++ API that works on both AMD and NVIDIA GPUs. You can write HIP code that compiles for:
- AMD GPUs using ROCm backend
- NVIDIA GPUs using CUDA backend
#include <hip/hip_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
// HIP kernel - runs on GPU (AMD or NVIDIA)
__global__ void addVectors(float *a, float *b, float *c, int n) {
// HIP provides both hipThreadIdx_x and threadIdx.x syntax
int i = hipThreadIdx_x + hipBlockIdx_x * hipBlockDim_x;
// Boundary check
if (i < n) {
c[i] = a[i] + b[i];
}
}
// HIP error checking macro
#define HIP_CHECK(call) \
do { \
hipError_t error = call; \
if (error != hipSuccess) { \
fprintf(stderr, "HIP error at %s:%d - %s\n", __FILE__, __LINE__, \
hipGetErrorString(error)); \
exit(EXIT_FAILURE); \
} \
} while(0)
int main() {
const int N = 1024;
const int bytes = N * sizeof(float);
// Host vectors
float *h_a, *h_b, *h_c;
// Device vectors
float *d_a, *d_b, *d_c;
// Get device information
int device;
hipDeviceProp_t props;
HIP_CHECK(hipGetDevice(&device));
HIP_CHECK(hipGetDeviceProperties(&props, device));
printf("Running on: %s\n", props.name);
// Allocate host memory
h_a = (float*)malloc(bytes);
h_b = (float*)malloc(bytes);
h_c = (float*)malloc(bytes);
// Initialize input vectors
for (int i = 0; i < N; i++) {
h_a[i] = sin(i) * sin(i);
h_b[i] = cos(i) * cos(i);
}
// Allocate device memory
HIP_CHECK(hipMalloc(&d_a, bytes));
HIP_CHECK(hipMalloc(&d_b, bytes));
HIP_CHECK(hipMalloc(&d_c, bytes));
// Copy data to device
HIP_CHECK(hipMemcpy(d_a, h_a, bytes, hipMemcpyHostToDevice));
HIP_CHECK(hipMemcpy(d_b, h_b, bytes, hipMemcpyHostToDevice));
// Launch configuration
int blockSize = 256;
int gridSize = (N + blockSize - 1) / blockSize;
// Method 1: Modern HIP kernel launch (recommended)
addVectors<<<gridSize, blockSize>>>(d_a, d_b, d_c, N);
// Alternative Method 2: Legacy HIP launch syntax
// hipLaunchKernelGGL(addVectors, dim3(gridSize), dim3(blockSize), 0, 0, d_a, d_b, d_c, N);
// Check for kernel launch errors
HIP_CHECK(hipGetLastError());
// Wait for GPU to finish
HIP_CHECK(hipDeviceSynchronize());
// Copy result back to host
HIP_CHECK(hipMemcpy(h_c, d_c, bytes, hipMemcpyDeviceToHost));
// Verify result
printf("Verification (first 5 elements):\n");
for (int i = 0; i < 5; i++) {
printf("%.3f + %.3f = %.3f\n", h_a[i], h_b[i], h_c[i]);
}
// Clean up memory
free(h_a); free(h_b); free(h_c);
hipFree(d_a); hipFree(d_b); hipFree(d_c);
printf("HIP vector addition completed successfully!\n");
return 0;
}1. Platform Abstraction
- Same source code works on AMD and NVIDIA
- Runtime determines which backend to use
- Automatic optimization for target architecture
2. API Compatibility
// HIP provides both syntaxes:
hipThreadIdx_x // HIP-specific
threadIdx.x // CUDA-compatible3. Memory Management
hipMalloc(): Allocate GPU memoryhipMemcpy(): Transfer datahipFree(): Free GPU memory- Same semantics as CUDA equivalents
For AMD GPUs (ROCm):
hipcc -o 01_vector_addition_hip 01_vector_addition_hip.cpp
./01_vector_addition_hipFor NVIDIA GPUs (CUDA backend):
hipcc --amdgpu-target=none -o 01_vector_addition_hip 01_vector_addition_hip.cpp
./01_vector_addition_hipHIP provides automated conversion tools:
# Convert CUDA file to HIP
hipify-perl 01_vector_addition_cuda.cu > 01_vector_addition_hip.cpp
# Or use the Python version
hipify-python 01_vector_addition_cuda.cu --output 01_vector_addition_hip.cpp#include <hip/hip_runtime.h>
#include <chrono>
class Timer {
private:
std::chrono::high_resolution_clock::time_point start, end;
public:
void startTimer() { start = std::chrono::high_resolution_clock::now(); }
void stopTimer() { end = std::chrono::high_resolution_clock::now(); }
double getElapsedMs() {
return std::chrono::duration<double, std::milli>(end - start).count();
}
};
// Add this to your main function:
Timer timer;
timer.startTimer();
addVectors<<<gridSize, blockSize>>>(d_a, d_b, d_c, N);
HIP_CHECK(hipDeviceSynchronize());
timer.stopTimer();
printf("Kernel execution time: %.3f ms\n", timer.getElapsedMs());GPU threads are organized in a three-level hierarchy:
Grid
├── Block 0
│ ├── Thread 0
│ ├── Thread 1
│ ├── ...
│ └── Thread N-1
├── Block 1
│ ├── Thread 0
│ ├── ...
└── Block M-1
Key Concepts:
- Thread: Smallest execution unit, runs one kernel instance
- Block: Group of threads that can cooperate and share memory
- Grid: Collection of blocks executing the same kernel
Understanding thread indexing is crucial for parallel algorithms:
// 1D indexing (most common)
int tid = threadIdx.x + blockIdx.x * blockDim.x;
// 2D indexing (for matrix operations)
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int tid = row * width + col;
// 3D indexing (for volume processing)
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int z = blockIdx.z * blockDim.z + threadIdx.z;#include <cuda_runtime.h>
#include <stdio.h>
__global__ void matrixAdd(float *A, float *B, float *C, int width, int height) {
// 2D thread indexing
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
// Boundary check
if (row < height && col < width) {
int index = row * width + col;
C[index] = A[index] + B[index];
}
}
int main() {
const int width = 1024;
const int height = 1024;
const int size = width * height * sizeof(float);
// Allocate host memory
float *h_A = (float*)malloc(size);
float *h_B = (float*)malloc(size);
float *h_C = (float*)malloc(size);
// Initialize matrices
for (int i = 0; i < width * height; i++) {
h_A[i] = 1.0f;
h_B[i] = 2.0f;
}
// Allocate device memory
float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);
// Copy to device
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
// Define block and grid sizes
dim3 blockSize(16, 16); // 256 threads per block
dim3 gridSize((width + blockSize.x - 1) / blockSize.x,
(height + blockSize.y - 1) / blockSize.y);
printf("Grid size: (%d, %d)\n", gridSize.x, gridSize.y);
printf("Block size: (%d, %d)\n", blockSize.x, blockSize.y);
printf("Total threads: %d\n", gridSize.x * gridSize.y * blockSize.x * blockSize.y);
// Launch kernel
matrixAdd<<<gridSize, blockSize>>>(d_A, d_B, d_C, width, height);
cudaDeviceSynchronize();
// Copy result back
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
// Verify result
bool success = true;
for (int i = 0; i < width * height; i++) {
if (h_C[i] != 3.0f) {
success = false;
break;
}
}
printf("Matrix addition %s\n", success ? "PASSED" : "FAILED");
// Cleanup
free(h_A); free(h_B); free(h_C);
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
return 0;
}Block size significantly affects performance. Consider these factors:
Occupancy: How well you utilize GPU resources
// Check occupancy for your kernel
int minGridSize, blockSize;
cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, matrixAdd, 0, 0);
printf("Suggested block size: %d\n", blockSize);Common Block Sizes:
- 32, 64, 128, 256, 512: Power of 2 sizes work well
- 16x16, 32x32: Good for 2D problems
- Avoid: Very small (<32) or very large (>1024) blocks
Create a kernel to demonstrate thread divergence:
__global__ void divergenceExample(int *data, int n) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < n) {
// This causes thread divergence within a warp
if (tid % 2 == 0) {
// Even threads do this
data[tid] = tid * 2;
} else {
// Odd threads do this
data[tid] = tid * 3;
}
}
}Performance Impact:
- Threads in the same warp that take different paths reduce efficiency
- GPU must execute both paths serially
- Avoid divergence when possible by restructuring algorithms
Efficient memory access is crucial for performance:
// Coalesced access (efficient)
__global__ void coalescedAccess(float *data) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
data[tid] = tid; // Each thread accesses consecutive memory
}
// Strided access (inefficient)
__global__ void stridedAccess(float *data, int stride) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
data[tid * stride] = tid; // Threads access memory with gaps
}- Modify vector addition to use different block sizes (64, 128, 512)
- Create a 2D matrix multiplication kernel
- Implement element-wise operations (sin, cos, exp) on large arrays
- Compare performance with CPU implementations
Error: nvcc not found
# Fix: Add CUDA to PATH
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATHError: cuda.h: No such file
# Fix: Install CUDA toolkit properly
sudo apt install nvidia-cuda-toolkit
# Or specify include path
nvcc -I/usr/local/cuda/include program.cuError: undefined reference to cudaMalloc
# Fix: Link CUDA runtime library
nvcc -lcudart program.cu1. "CUDA error: invalid device"
// Check device availability
int deviceCount;
cudaGetDeviceCount(&deviceCount);
if (deviceCount == 0) {
printf("No CUDA-capable devices found\n");
return -1;
}
// Set specific device
cudaSetDevice(0);2. "CUDA error: out of memory"
// Check available memory
size_t free_mem, total_mem;
cudaMemGetInfo(&free_mem, &total_mem);
printf("GPU memory: %zu MB free of %zu MB total\n",
free_mem/1024/1024, total_mem/1024/1024);
// Use smaller data sizes or process in chunks
const int MAX_SIZE = free_mem / sizeof(float) / 4; // Reserve 25% buffer3. "CUDA error: invalid configuration"
// Check device limits
cudaDeviceProp props;
cudaGetDeviceProperties(&props, 0);
printf("Max threads per block: %d\n", props.maxThreadsPerBlock);
printf("Max block dimensions: (%d, %d, %d)\n",
props.maxThreadsDim[0], props.maxThreadsDim[1], props.maxThreadsDim[2]);
printf("Max grid dimensions: (%d, %d, %d)\n",
props.maxGridSize[0], props.maxGridSize[1], props.maxGridSize[2]);1. Error Checking Macro
#define CUDA_CHECK(call) \
do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA error at %s:%d - %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
} \
} while(0)
// Use consistently
CUDA_CHECK(cudaMalloc(&d_ptr, size));
CUDA_CHECK(cudaMemcpy(d_ptr, h_ptr, size, cudaMemcpyHostToDevice));2. Kernel Launch Error Checking
kernel<<<grid, block>>>(args);
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
printf("Kernel launch error: %s\n", cudaGetErrorString(err));
}
CUDA_CHECK(cudaDeviceSynchronize());3. Debug Prints in Kernels
__global__ void debugKernel(float *data, int n) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
// Only print from a few threads to avoid spam
if (tid < 5) {
printf("Thread %d: processing data[%d] = %f\n", tid, tid, data[tid]);
}
}1. Memory Bandwidth Optimization
// Use appropriate data types
float4 vec = make_float4(1.0f, 2.0f, 3.0f, 4.0f); // Vectorized load
// Coalesced memory access
__global__ void efficientAccess(float *data, int stride) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
// Good: consecutive access
data[tid] = tid;
// Bad: strided access
// data[tid * stride] = tid;
}2. Occupancy Optimization
// Check theoretical occupancy
int maxActiveBlocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&maxActiveBlocks, kernel, blockSize, 0);
// Get suggested block size
int minGridSize, optimalBlockSize;
cudaOccupancyMaxPotentialBlockSize(&minGridSize, &optimalBlockSize, kernel, 0, 0);
printf("Optimal block size: %d\n", optimalBlockSize);3. Shared Memory Usage
__global__ void sharedMemoryExample(float *data, int n) {
// Allocate shared memory
__shared__ float shared_data[256];
int tid = threadIdx.x;
int gid = threadIdx.x + blockIdx.x * blockDim.x;
// Load data to shared memory
if (gid < n) {
shared_data[tid] = data[gid];
}
__syncthreads(); // Synchronize within block
// Process data in shared memory (faster)
if (gid < n) {
data[gid] = shared_data[tid] * 2.0f;
}
}NVIDIA Tools:
# Legacy profiler (deprecated but still useful)
nvprof ./your_program
# Modern profiler
ncu --metrics gpu__time_duration.avg ./your_program
# System-wide profiling
nsys profile --trace=cuda,nvtx ./your_programAMD ROCm Tools:
# Profile HIP applications
rocprof --hip-trace ./your_program
# Detailed metrics
rocprof --stats ./your_program
# System profiling
roctx-trace ./your_program#include <chrono>
#include <iostream>
class GpuTimer {
cudaEvent_t start, stop;
float elapsedTime;
public:
GpuTimer() {
cudaEventCreate(&start);
cudaEventCreate(&stop);
}
void startTimer() {
cudaEventRecord(start, 0);
}
void stopTimer() {
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start, stop);
}
float getElapsedMs() { return elapsedTime; }
~GpuTimer() {
cudaEventDestroy(start);
cudaEventDestroy(stop);
}
};
// Usage example
GpuTimer timer;
timer.startTimer();
kernel<<<grid, block>>>(args);
timer.stopTimer();
printf("Kernel execution time: %.3f ms\n", timer.getElapsedMs());✅ Memory Access:
- Use coalesced memory access patterns
- Minimize CPU-GPU data transfers
- Consider using pinned memory for large transfers
✅ Thread Organization:
- Choose appropriate block sizes (multiples of 32)
- Maximize occupancy
- Minimize thread divergence
✅ Algorithm Design:
- Use shared memory for frequently accessed data
- Consider memory bank conflicts
- Implement efficient reduction patterns
✅ Platform-Specific:
- H100: Utilize Tensor Cores for AI workloads
- MI300X: Leverage unified memory architecture
- Use vendor-optimized libraries (cuBLAS, rocBLAS)
- Debug a broken kernel with out-of-bounds access
- Optimize memory bandwidth using vectorized loads
- Profile a kernel and identify bottlenecks
- Implement shared memory optimization for matrix multiplication
- Compare performance across different block sizes
This module covered the fundamentals of GPU programming:
- Architecture: Understanding SIMT execution and memory hierarchy
- Programming Models: CUDA for NVIDIA, HIP for cross-platform development
- Thread Management: Organizing parallel work efficiently
- Memory Management: Allocating and transferring data between CPU and GPU
- Optimization: Basic techniques for improving performance
- Debugging: Tools and techniques for troubleshooting GPU programs
Next Steps:
- Practice with the provided examples
- Experiment with different problem sizes and block configurations
- Learn about advanced memory patterns in Module 2
- Explore GPU-accelerated libraries for your domain