Module 6: Fundamental Parallel Algorithms - Comprehensive Guide

Environment note: The examples and benchmarks in this module are tested in Docker with CUDA 13.0.1 (Ubuntu 24.04) and ROCm 7.0.1 (Ubuntu 24.04) to ensure reproducibility. Recent algorithm fixes improve performance.

Introduction

Fundamental parallel algorithms form the core building blocks of high-performance GPU computing. These algorithms—convolution, stencil computations, histogram operations, reduction patterns, and prefix sum operations—appear across virtually all domains of parallel computing, from scientific simulation to machine learning, image processing to data analytics.

This module provides comprehensive coverage of these essential algorithms, emphasizing both theoretical understanding and practical, optimized implementations. Each algorithm is presented with multiple variants, from simple educational implementations to production-optimized versions that achieve near-peak performance on modern GPU architectures.

Theoretical Foundations

Parallel Algorithm Design Principles

Work and Depth Complexity:

Work (W): Total number of operations performed
Depth (D): Length of the longest dependency chain
Parallelism: P = W/D represents the maximum theoretical speedup

Key Design Considerations:

Data Locality: Minimize memory traffic through cache-friendly access patterns
Load Balance: Distribute work evenly across processing elements
Communication: Minimize inter-thread communication and synchronization
Scalability: Ensure algorithms scale efficiently with problem size and core count

GPU-Specific Considerations:

Memory Hierarchy: Leverage shared memory, registers, and cache effectively
Thread Divergence: Minimize branching within warps/wavefronts
Occupancy: Balance resource usage to maximize concurrent threads
Memory Coalescing: Ensure efficient memory access patterns

1. Convolution Algorithms

Mathematical Foundation

Convolution is a fundamental operation in signal processing and computer vision:

1D Discrete Convolution:

(f * g)[n] = Σ f[m] * g[n - m]
             m

2D Discrete Convolution:

(f * g)[x,y] = ΣΣ f[i,j] * g[x - i, y - j]
               i j

Algorithm Variants

Direct Convolution

Complexity: O(N * M) for 1D, O(N² * M²) for 2D

Simple implementation with nested loops
Good for small kernels (M < 16)
Memory-bound for large inputs

Separable Convolution

Complexity: O(N² * M) for 2D separable kernels

Decomposes 2D kernel into 1D operations
Reduces complexity from O(N² * M²) to O(N² * M)
Applicable when kernel is separable: K(x,y) = K₁(x) * K₂(y)

FFT-based Convolution

Complexity: O(N log N)

Uses Fast Fourier Transform: F⁻¹(F(f) * F(g))
Efficient for large kernels (M > 64)
Requires complex arithmetic and additional memory

Implementation Strategies

Memory Access Optimization:

Shared Memory Tiling: Load data blocks into shared memory
Halo Region Loading: Handle boundary conditions efficiently
Coalesced Access: Ensure consecutive threads access consecutive memory
Register Reuse: Maximize data reuse within thread registers

Boundary Condition Handling:

Zero Padding: Extend input with zeros
Reflection: Mirror boundary values
Periodic: Wrap around boundaries
Constant: Use fixed boundary value

Performance Analysis

Memory Bandwidth Requirements:

Input bandwidth: N bytes/operation
Output bandwidth: N bytes/operation
Kernel bandwidth: M bytes (reused across operations)
Total: 2N + M bytes per output element

Arithmetic Intensity:

Operations per output: M multiply-add operations
Bytes per output: 2N + M (assuming no reuse)
AI = M / (2N + M) multiply-adds per byte

Optimization Goals:

Maximize shared memory utilization
Minimize global memory transactions
Achieve high arithmetic intensity through data reuse

2. Stencil Computations

Mathematical Foundation

Stencil operations apply a fixed pattern of computation to each point in a grid:

General Form:

u'[i] = Σ c[j] * u[i + j]
        j∈S

Where S is the stencil pattern and c[j] are coefficients.

Common Stencil Patterns:

3-point 1D: [-1, 0, 1] (first derivative approximation)
5-point 2D: Cross pattern for Laplacian operator
7-point 3D: Six neighbors plus center point
27-point 3D: Full neighborhood stencil

Algorithm Design

Memory Access Patterns

Ghost Cell Management:

Extend computational domain with boundary values
Load ghost cells into shared memory for efficient access
Handle different boundary condition types

Temporal Blocking:

Compute multiple time steps within shared memory
Reduces global memory traffic
Requires careful dependency analysis

Optimization Techniques

Thread Coarsening:

// Each thread processes multiple points
for (int i = 0; i < COARSENING_FACTOR; i++) {
    int idx = blockIdx.x * blockDim.x * COARSENING_FACTOR + 
              threadIdx.x + i * blockDim.x;
    if (idx < n) {
        output[idx] = stencil_operation(input, idx);
    }
}

Register Blocking:

Keep frequently accessed values in registers
Shift register values for sliding window operations
Reduces shared memory bank conflicts

Performance Considerations

Memory Bandwidth:

Each point requires S memory accesses
Output requires 1 write operation
Ghost cells reduce effective bandwidth utilization

Arithmetic Intensity:

S multiply-add operations per point
(S + 1) memory operations per point
AI = S / (S + 1) operations per byte

Scalability:

Performance scales with problem size until memory bandwidth limit
Multi-GPU scaling requires inter-GPU communication for ghost cells
Communication-computation overlap critical for large problems

3. Histogram Operations

Problem Formulation

Histogram construction counts occurrences of values within specified bins:

Basic Histogram:

for each input value v:
    bin = compute_bin(v)
    histogram[bin]++

Challenges in Parallel Implementation:

Race conditions in bin updates
Atomic operation contention
Load imbalance across bins
Memory access patterns

Atomic Operations Approach

Direct Atomic Increment:

__global__ void atomic_histogram(int *input, int *histogram, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        int bin = compute_bin(input[idx]);
        atomicAdd(&histogram[bin], 1);
    }
}

Performance Characteristics:

Simple implementation
High contention for popular bins
Performance degrades with skewed distributions
Memory bandwidth underutilized due to atomic serialization

Privatization Strategy

Per-Block Private Histograms:

__global__ void privatized_histogram(int *input, int *histogram, int n) {
    extern __shared__ int private_hist[];
    
    // Initialize private histogram
    for (int i = threadIdx.x; i < NUM_BINS; i += blockDim.x) {
        private_hist[i] = 0;
    }
    __syncthreads();
    
    // Build private histogram
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        int bin = compute_bin(input[idx]);
        atomicAdd(&private_hist[bin], 1);
    }
    __syncthreads();
    
    // Merge to global histogram
    for (int i = threadIdx.x; i < NUM_BINS; i += blockDim.x) {
        if (private_hist[i] > 0) {
            atomicAdd(&histogram[i], private_hist[i]);
        }
    }
}

Advantages:

Reduces global memory contention
Better performance for uniform distributions
Scales with number of blocks

Disadvantages:

Requires additional memory (NUM_BINS per block)
Final merge step still requires global atomics
May not fit in shared memory for large bin counts

Coarsening Techniques

Thread Coarsening:

Each thread processes multiple input elements
Builds local count before atomic update
Reduces atomic operation frequency

Warp-Level Aggregation:

Use warp shuffle operations to aggregate within warp
Single atomic operation per warp per bin
Significant reduction in atomic contention

4. Reduction Patterns

Algorithm Variants

Tree Reduction (Hillis-Steele)

Characteristics:

Work: O(n log n)
Depth: O(log n)
Simple implementation
Not work-efficient for large inputs

Implementation Pattern:

__global__ void tree_reduction(float *input, float *output, int n) {
    extern __shared__ float sdata[];
    
    // Load data to shared memory
    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    sdata[tid] = (idx < n) ? input[idx] : 0.0f;
    __syncthreads();
    
    // Tree reduction in shared memory
    for (int stride = 1; stride < blockDim.x; stride *= 2) {
        if (tid % (2 * stride) == 0) {
            sdata[tid] += sdata[tid + stride];
        }
        __syncthreads();
    }
    
    // Write result
    if (tid == 0) {
        output[blockIdx.x] = sdata[0];
    }
}

Optimized Tree Reduction (Brent-Kung)

Characteristics:

Work: O(n)
Depth: O(log n)
Work-efficient implementation
More complex but better scaling

Key Optimizations:

Avoid Thread Divergence: Use contiguous thread access
Bank Conflict Avoidance: Careful shared memory indexing
Loop Unrolling: Reduce loop overhead for small strides

Modern Warp-Level Reduction

Using Cooperative Groups:

#include <cooperative_groups.h>
namespace cg = cooperative_groups;

__device__ float warp_reduce(float val) {
    cg::thread_block_tile<32> tile32 = cg::tiled_partition<32>(cg::this_thread_block());
    
    // Warp-level reduction using shuffle
    for (int offset = tile32.size() / 2; offset > 0; offset /= 2) {
        val += tile32.shfl_down(val, offset);
    }
    return val;
}

Multi-level Reduction

Large Array Handling:

Block-level Reduction: Reduce within each block
Inter-block Reduction: Reduce block results
Recursive Application: For very large arrays

Kernel Launch Strategy:

void multi_level_reduction(float *input, float *output, int n) {
    int blocks = (n + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;
    float *temp_output;
    
    while (n > 1) {
        cudaMalloc(&temp_output, blocks * sizeof(float));
        reduction_kernel<<<blocks, THREADS_PER_BLOCK>>>(input, temp_output, n);
        
        input = temp_output;
        n = blocks;
        blocks = (n + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;
    }
    
    cudaMemcpy(output, input, sizeof(float), cudaMemcpyDeviceToHost);
}

5. Prefix Sum Algorithms

Algorithm Theory

Prefix Sum Definition: Given array [a₀, a₁, ..., aₙ₋₁], compute:

Inclusive: [a₀, a₀+a₁, a₀+a₁+a₂, ...]
Exclusive: [0, a₀, a₀+a₁, a₀+a₁+a₂, ...]

Hillis-Steele Algorithm

Characteristics:

Work: O(n log n)
Depth: O(log n)
Step-efficient (minimum depth)
High memory bandwidth requirements

Implementation:

__global__ void hillis_steele_scan(float *input, float *output, int n) {
    extern __shared__ float temp[];
    
    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    
    // Load input data
    temp[tid] = (idx < n) ? input[idx] : 0.0f;
    __syncthreads();
    
    // Up-sweep phase
    for (int stride = 1; stride < blockDim.x; stride *= 2) {
        float val = 0.0f;
        if (tid >= stride) {
            val = temp[tid - stride];
        }
        __syncthreads();
        
        if (tid >= stride) {
            temp[tid] += val;
        }
        __syncthreads();
    }
    
    // Write output
    if (idx < n) {
        output[idx] = temp[tid];
    }
}

Blelloch Algorithm (Work-Efficient)

Characteristics:

Work: O(n)
Depth: O(log n)
Work-efficient
Two-phase algorithm (up-sweep + down-sweep)

Up-sweep Phase:

// Build tree bottom-up
for (int stride = 1; stride < n; stride *= 2) {
    for (int i = stride - 1; i < n; i += 2 * stride) {
        temp[i + stride] += temp[i];
    }
}

Down-sweep Phase:

// Set root to identity
temp[n - 1] = 0;

// Propagate down
for (int stride = n / 2; stride > 0; stride /= 2) {
    for (int i = stride - 1; i < n; i += 2 * stride) {
        float t = temp[i];
        temp[i] = temp[i + stride];
        temp[i + stride] += t;
    }
}

Segmented Scan

Problem: Perform independent scans on multiple segments within array

Approach: Use flag array to identify segment boundaries

if (flags[i] == 1) {
    // Start of new segment - don't propagate from previous
    result[i] = input[i];
} else {
    // Continue current segment
    result[i] = input[i] + result[i-1];
}

Applications:

Sparse matrix operations
Variable-length sequence processing
Parallel parsing algorithms

Large Array Scanning

Multi-phase Approach:

Phase 1: Scan within blocks, save block sums
Phase 2: Scan block sums to get per-block offsets
Phase 3: Add block offsets to within-block results

Memory Access Optimization:

Minimize global memory transactions
Maximize shared memory utilization
Overlap computation with memory access

Cross-Platform Optimization Strategies

NVIDIA CUDA Optimizations

Warp-Level Primitives:

__shfl_down() for intra-warp communication
__ballot_sync() for warp voting operations
__activemask() for active thread detection

Cooperative Groups:

#include <cooperative_groups.h>
namespace cg = cooperative_groups;

auto block = cg::this_thread_block();
auto warp = cg::tiled_partition<32>(block);

Memory Hierarchy:

L1 cache configuration (prefer shared vs. L1)
Texture memory for read-only data
Constant memory for small, uniform data

AMD ROCm Optimizations

Wavefront-Level Operations:

64-thread wavefronts vs. 32-thread warps
Different shuffle patterns and voting operations
LDS (Local Data Store) optimization

Memory Subsystem:

Different cache hierarchy
Memory channel interleaving
Infinity Fabric for multi-GPU

Platform-Specific Intrinsics:

// AMD-specific wavefront operations
#ifdef __HIP_PLATFORM_AMD__
    result = __shfl_down(value, offset, 64);  // 64-thread wavefront
#else
    result = __shfl_down_sync(0xffffffff, value, offset, 32);  // 32-thread warp
#endif

HIP Portability

Single Source Code:

#include <hip/hip_runtime.h>

// HIP kernel works on both NVIDIA and AMD
__global__ void portable_kernel(float *data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        // Portable implementation
        data[idx] = process_data(data[idx]);
    }
}

Runtime Architecture Detection:

hipDeviceProp_t prop;
hipGetDeviceProperties(&prop, 0);

if (prop.major >= 7) {  // Turing or later
    // Use Tensor Cores if available
} else {
    // Fallback implementation
}

Performance Analysis and Optimization

Profiling Methodology

NVIDIA Profiling:

# Compute throughput analysis
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed ./algorithm

# Memory bandwidth analysis  
ncu --metrics dram__throughput.avg.pct_of_peak_sustained_elapsed ./algorithm

# Occupancy analysis
ncu --metrics sm__warps_active.avg.pct_of_peak_sustained_active ./algorithm

AMD Profiling:

# Basic performance metrics
rocprof --stats ./algorithm

# Detailed memory analysis
rocprof --hip-trace --stats ./algorithm

# Hardware performance counters
rocprof --hsa-trace --stats ./algorithm

Optimization Guidelines

Memory Optimization:

Coalescing: Ensure consecutive threads access consecutive memory
Bank Conflicts: Avoid shared memory bank conflicts through padding
Cache Utilization: Design access patterns for optimal cache usage
Bandwidth Utilization: Achieve high percentage of peak bandwidth

Compute Optimization:

Occupancy: Balance register usage, shared memory, and thread count
Divergence: Minimize branching within warps/wavefronts
Instruction Mix: Balance arithmetic operations with memory operations
Loop Optimization: Unroll loops to reduce overhead

Algorithm-Specific Optimization:

Data Reuse: Maximize temporal and spatial locality
Work Distribution: Balance load across processing elements
Communication: Minimize inter-thread synchronization
Scalability: Design for multiple GPU scaling

Performance Expectations

Typical Performance Gains:

Convolution: 10-100x over CPU, depending on kernel size
Stencil: 5-50x over CPU, limited by memory bandwidth
Histogram: 2-20x over CPU, varies with distribution
Reduction: 50-500x over CPU for large arrays
Prefix Sum: 10-100x over CPU, algorithm-dependent

Scalability Characteristics:

Problem Size: Performance increases with array size until memory limit
Multi-GPU: Near-linear scaling for embarrassingly parallel algorithms
Architecture: Performance scales with memory bandwidth and compute units

Applications and Use Cases

Scientific Computing

Computational Fluid Dynamics:

Stencil operations for finite difference methods
Reduction operations for convergence checking
Prefix sum for particle sorting and load balancing

Image Processing:

Convolution for filters and feature detection
Histogram for intensity analysis and equalization
Reduction for global statistics computation

Molecular Dynamics:

Reduction for energy and force calculations
Prefix sum for particle neighbor list construction
Histogram for radial distribution functions

Machine Learning

Convolutional Neural Networks:

Optimized convolution for forward and backward passes
Reduction for batch normalization and loss computation
Prefix sum for attention mechanisms

Data Preprocessing:

Histogram for data distribution analysis
Prefix sum for cumulative probability computation
Reduction for normalization statistics

Training Optimization:

All-reduce for gradient aggregation in distributed training
Scan operations for dynamic batching
Histogram for gradient distribution analysis

Data Analytics

Database Operations:

Prefix sum for parallel select and join operations
Reduction for aggregation queries (SUM, COUNT, AVG)
Histogram for data profiling and statistics

Stream Processing:

Sliding window reductions for real-time analytics
Prefix sum for cumulative metrics computation
Histogram for anomaly detection

Graph Analytics:

Reduction for vertex property aggregation
Prefix sum for edge list processing
Histogram for degree distribution analysis

Summary

Module 6 provides comprehensive coverage of fundamental parallel algorithms that form the building blocks of high-performance GPU computing:

Key Algorithmic Concepts:

Work vs. Depth Complexity: Understanding theoretical performance limits
Memory Access Patterns: Designing cache-friendly algorithms
Load Balancing: Distributing work efficiently across GPU cores
Cross-Platform Optimization: Creating portable high-performance code

Practical Skills Developed:

Algorithm Implementation: From naive to highly optimized versions
Performance Analysis: Using profiling tools to identify bottlenecks
Optimization Techniques: Applying GPU-specific optimization strategies
Cross-Platform Development: Creating code that works efficiently on multiple architectures

Real-World Applications:

Scientific Simulation: CFD, molecular dynamics, image processing
Machine Learning: CNN operations, data preprocessing, training optimization
Data Analytics: Database operations, stream processing, graph analytics

These fundamental algorithms appear repeatedly in more complex applications and form the foundation for the advanced topics covered in subsequent modules. Mastering these patterns is essential for developing efficient GPU-accelerated software across all domains of parallel computing.

The combination of theoretical understanding, practical implementation experience, and cross-platform optimization skills developed in this module provides the foundation for tackling increasingly complex parallel computing challenges in scientific, industrial, and research applications.

Uh oh!

FilesExpand file tree

content.md

Latest commit

History

content.md

File metadata and controls

Module 6: Fundamental Parallel Algorithms - Comprehensive Guide

Introduction

Theoretical Foundations

Parallel Algorithm Design Principles

1. Convolution Algorithms

Mathematical Foundation

Algorithm Variants

Direct Convolution

Separable Convolution

FFT-based Convolution

Implementation Strategies

Performance Analysis

2. Stencil Computations

Mathematical Foundation

Algorithm Design

Memory Access Patterns

Optimization Techniques

Performance Considerations

3. Histogram Operations

Problem Formulation

Atomic Operations Approach

Privatization Strategy

Coarsening Techniques

4. Reduction Patterns

Algorithm Variants

Tree Reduction (Hillis-Steele)

Optimized Tree Reduction (Brent-Kung)

Modern Warp-Level Reduction

Multi-level Reduction

5. Prefix Sum Algorithms

Algorithm Theory

Hillis-Steele Algorithm

Blelloch Algorithm (Work-Efficient)

Segmented Scan

Large Array Scanning

Cross-Platform Optimization Strategies

NVIDIA CUDA Optimizations

AMD ROCm Optimizations

HIP Portability

Performance Analysis and Optimization

Profiling Methodology

Optimization Guidelines

Performance Expectations

Applications and Use Cases

Scientific Computing

Machine Learning

Data Analytics

Summary