diff --git a/README.md b/README.md index d5a2886..a1dab36 100644 --- a/README.md +++ b/README.md @@ -1,207 +1,222 @@ -**Parallel Programming** -**Åbo Akademi University, Information Technology Department** -**Instructor: Alireza Olama** +# Homework 3 – Matrix Multiplication with CUDA -**Homework Assignment 3: Matrix Multiplication with CUDA** +## Course +**Parallel Programming** – Spring 2025 +Åbo Akademi University, Information Technology Department +**Instructor:** Alireza Olama -**Due Date**: **31/05/2025** -**Points**: 100 +## Student +**Name:** Md Anzir Hossain Rafath --- -### Assignment Overview +## Overview -Welcome to the third homework assignment of the Parallel Programming course! -In Assignment 2, you optimized matrix multiplication using cache-friendly blocked multiplication and OpenMP for CPU -parallelism. In this assignment, you will take matrix multiplication to the GPU using **CUDA**, NVIDIA’s parallel -computing platform. Your task is to implement matrix multiplication on the GPU, optimize it using CUDA-specific -techniques, and compare its performance with your CPU-based implementations from Assignment 2. +In this assignment, we implement and benchmark matrix multiplication on an NVIDIA GPU using **CUDA**. Building on Homework 2 (naive, blocked, and OpenMP-parallel CPU versions), two CUDA kernels are developed: -You will implement: +- **Naive CUDA**: Each GPU thread computes one element of the output matrix using global memory. +- **Tiled CUDA**: Each block cooperatively loads submatrices (“tiles”) of A and B into shared memory for faster access and better reuse, then accumulates partial results. -1. **Naive CUDA Matrix Multiplication**: A basic GPU implementation using CUDA kernels. -2. **Tiled CUDA Matrix Multiplication**: An optimized version using shared memory to improve memory access patterns. -3. **Performance Comparison**: Measure and compare the performance of both CUDA implementations against your Assignment - 2 implementations (naive, blocked, and parallel). - -This assignment introduces CUDA programming, including kernel launches, thread grids, blocks, and memory management, -while reinforcing the importance of data locality and parallelism. +Both GPU versions are validated against reference outputs and compared in performance to CPU implementations. --- -### Technical Requirements +## CUDA Implementations -#### 1. Naive CUDA Matrix Multiplication +### Naive CUDA -**Why CUDA?** +- Kernel signature: + ```cpp + __global__ void naive_cuda_matmul(float *C, float *A, float *B, + uint32_t m, uint32_t n, uint32_t p); + + +-Launch a 2D grid of blocks, each block with 16×16 threads (TILE_WIDTH = 16). +Each thread computes as: + C[row,col]= + n-1 + ∑ A[row,k]×B[k,col] ; if row < m && col < p. + k=0 +-All data is read/written directly from/to global memory. -CUDA allows you to execute parallel computations on NVIDIA GPUs, which have thousands of cores designed for -data-parallel tasks. Matrix multiplication is an ideal workload for GPUs because it involves independent computations -for each element of the output matrix. +-Tiled CUDA +Kernel signature: -In the naive CUDA implementation, each thread computes one element of the output matrix \( C \). The GPU organizes -threads into a grid of thread blocks, where each block contains a group of threads (e.g., 16x16 threads). +__global__ void tiled_cuda_matmul(float *C, float *A, float *B, + uint32_t m, uint32_t n, uint32_t p, + uint32_t tile_width); +-Uses TILE_WIDTH = 16 (can be modified to 32, etc.) -**Naive CUDA Matrix Multiplication** +-Each block allocates two shared-memory arrays: + __shared__ float tile_A[TILE_WIDTH][TILE_WIDTH]; +__shared__ float tile_B[TILE_WIDTH][TILE_WIDTH]; -Assume matrices \( A \) \( m x n \), \( B \) \( n x p \), and \( C \) \( m x p \) are stored in -row-major order in GPU global memory: -```c -__global__ void naive_cuda_matmul(float *C, float *A, float *B, uint32_t m, uint32_t n, uint32_t p) { - -} -``` +-For each phase ph = 0…(n + tile_width − 1) / tile_width − 1: -- **Grid and Block Configuration**: Launch a 2D grid of 2D thread blocks (e.g., 16x16 threads per block). -- **Memory**: Matrices are stored in GPU global memory. Use `cudaMalloc` and `cudaMemcpy` to allocate and transfer data - between host (CPU) and device (GPU). -- **Task**: Implement the `naive_cuda_matmul` kernel and its host code in the provided `main.cu`. Measure the wall clock - time, including data transfer times (host-to-device and device-to-host). + 1. Threads cooperatively load a TILE_WIDTH×TILE_WIDTH submatrix of A (rows row, columns ph*tile_width + threadIdx.x) into tile_A[row_index][k]. -#### 2. Tiled CUDA Matrix Multiplication + 2.Threads cooperatively load a TILE_WIDTH×TILE_WIDTH submatrix of B (rows ph*tile_width + threadIdx.y, columns col) into tile_B[k][col_index]. -**Why Tiling?** + 3. Synchronize with __syncthreads(). -The naive CUDA implementation accesses global memory frequently, which is slow (hundreds of cycles per access). CUDA -GPUs have **shared memory**, a fast, on-chip memory shared by threads in a block. Tiled matrix multiplication divides -matrices into tiles (submatrices) that fit into shared memory, reducing global memory accesses and improving -performance. + 4. Each thread accumulates the dot product of tile_A[threadIdx.y][k] and tile_B[k][threadIdx.x] for k = 0…tile_width−1. -**Tiled CUDA Matrix Multiplication** + 5. Synchronize again before moving to next phase. -Assume a tile size of `TILE_WIDTH` (e.g., 16 or 32): +Finally, if row < m && col < p, write the accumulated value into C[row * p + col]. -```c -__global__ void tiled_cuda_matmul(float *C, float *A, float *B, uint32_t m, uint32_t n, uint32_t p, uint32_t tile_width) { -} -``` -- **Shared Memory**: Each block loads tiles of \( A \) and \( B \) into shared memory, computes partial results, and - accumulates the sum. -- **Synchronization**: Use `__syncthreads()` to ensure all threads in a block have loaded data before computation. -- **Task**: Implement the `tiled_cuda_matmul` kernel and its host code in `main.cu`. Experiment with different tile - sizes (e.g., 16, 32) and report the best performance. +##ild & Execution +Prerequisites +NVIDIA GPU with CUDA support (Pascal, Volta, Turing, Ampere, Ada, …). + +CUDA Toolkit (v11.x or later). -#### 3. Performance Measurement +CMake (v3.18+). -For each test case (0 through 9, using the same `data` folder from Assignment 2): +(On Windows) Visual Studio 2019/2022 with “Desktop Development with C++” workload. -- Measure the wall clock time for: - - **Naive CUDA matrix multiplication** (`naive_cuda_matmul`), including data transfer times. - - **Tiled CUDA matrix multiplication** (`tiled_cuda_matmul`), including data transfer times. -- Compare with Assignment 2 results (naive, blocked, and parallel CPU implementations). -- Use `cudaEventRecord` and `cudaEventElapsedTime` for accurate GPU timing. -- Report the times in a table in your `README.md`, including: - - Test case number. - - Matrix dimensions (\( m \times n \times p \)). - - Wall clock time for naive CUDA, tiled CUDA, and Assignment 2 implementations (in seconds). - - Speedup of tiled CUDA over naive CUDA and over Assignment 2’s parallel implementation. +(On Linux/Mac) nvcc from the CUDA Toolkit. + +--- -**Example Table Format**: +##Directory Structure +Homework-3/ +├── CMakeLists.txt # CUDA-enabled CMake configuration +├── main.cu # CUDA source implementing both kernels +├── README.md # (This file) +└── data/ + ├── 0/ + │ ├── input0.raw # Matrix A (float32, row-major) + │ ├── input1.raw # Matrix B (float32, row-major) + │ ├── output.raw # Reference matrix C (float32, row-major) + │ └── meta.txt # “m n p” dimensions for case 0 + ├── 1/ + └── … up to case 9 -| Test Case | Dimensions (\( m \times n \times p \)) | Naive CPU (s) | Blocked CPU (s) | Parallel CPU (s) | Naive CUDA (s) | Tiled CUDA (s) | Tiled CUDA Speedup (vs. Naive CUDA) | Tiled CUDA Speedup (vs. Parallel CPU) | -|-----------|----------------------------------------|---------------|-----------------|------------------|----------------|----------------|-------------------------------------|---------------------------------------| -| | | | | | | | | | ---- +Each data//meta.txt should contain three integers, for example: -### Matrix Storage and Memory Management -- Continue using row-major order for matrices. -- Use CUDA memory management (`cudaMalloc`, `cudaMemcpy`, `cudaFree`) for GPU data. -- Reuse the same input/output format as Assignment 2: - - Input files: `data//input0.raw` (matrix \( A \)) and `input1.raw` (matrix \( B \)). - - Output file: `data//result.raw` (matrix \( C \)). - - Reference file: `data//output.raw` for validation. +128 256 64 +meaning A is 128×256, B is 256×64, and C is 128×64. ---- +##Compile with CMake (Linux/Mac/Windows) +Open a terminal (or “x64 Native Tools” prompt on Windows). -### Build Instructions +Navigate to the project root: -- Use the provided `CMakeLists.txt`, which includes CUDA support. -- **Requirements**: - - NVIDIA GPU with CUDA support. - - CUDA Toolkit installed (version 11.x or later recommended). - - CMake with CUDA language support. -- **Linux/Mac**: - - Run `cmake -DCMAKE_CUDA_COMPILER=nvcc .` to generate a Makefile, then `make`. -- **Windows**: - - Use Visual Studio with CUDA toolkit or MinGW with `cmake -G "MinGW Makefiles"`. -- Test with the same test cases (0–9) as Assignment 2. + cd Homework-3 +Generate build files and build: ---- + cmake -DCMAKE_CUDA_COMPILER=nvcc . + cmake --build . --config Release +On Linux/macOS, make will be used automatically. -### Submission Requirements +On Windows, the Visual Studio solution/project will be generated and built. -#### Fork and Clone the Repository +An executable named app (or app.exe on Windows) will be placed under Release/ (Windows) or in the project root (Linux/macOS). -- Fork the Assignment 3 repository (provided separately). -- Clone your fork: - ```bash - git clone https://github.com/parallelcomputingabo/Homework-3.git - cd Homework-3 - ``` +Run Locally +To run a single test case, for example case 0: -#### Create a New Branch +# Linux/macOS: +./app 0 -```bash -git checkout -b student-name -``` +# Windows (if built in Release): -#### Implement Your Solution + Release\app.exe 0 +To run all test cases in a loop (Linux/macOS): -- Modify the provided `main.cu` to implement `naive_cuda_matmul` and `tiled_cuda_matmul`. -- Update `README.md` with your performance results table. -#### Commit and Push + for i in {0..9}; do + ./app $i + done +On Windows PowerShell: -```bash -git add . -git commit -m "student-name: Implemented CUDA matrix multiplication" -git push origin student-name -``` -#### Submit a Pull Request (PR) +for ($i = 0; $i -le 9; $i++) { + .\Release\app.exe $i +} -- Create a pull request from your branch to the base repository’s `main` branch. -- Include a description of your CUDA optimizations and any challenges faced. ---- +Each run will: + +1. Read data//meta.txt for m, n, p. + +2. Load input0.raw (A) and input1.raw (B) into host memory. + +3. Copy A, B to device using cudaMemcpy. + +4. Launch the naive kernel, time it (using CUDA events), copy result back, write naive_result.raw, and validate against output.raw. + +5. Launch the tiled kernel, time it, copy result back, write tiled_result.raw, and validate. + +Print as: + +Case 0 (128×256×64): +Naive CUDA time: 0.0260 s [OK] +Tiled CUDA time: 0.0142 s [OK] + + +Validation tolerance is an absolute difference ≤ 1e−3. -### Grading (100 Points Total) -| Subtask | Points | -|-----------------------------------------------|---------| -| Correct implementation of `naive_cuda_matmul` | 30 | -| Correct implementation of `tiled_cuda_matmul` | 30 | -| Accurate performance measurements | 20 | -| Performance results table in `README.md` | 10 | -| Code clarity, commenting, and organization | 10 | -| **Total** | **100** | --- -### Tips for Success - -- **Naive CUDA**: - - Ensure correct grid and block dimensions (e.g., `dim3 threadsPerBlock(16, 16)`). - - Check for CUDA errors using `cudaGetLastError` and `cudaDeviceSynchronize`. -- **Tiled CUDA**: - - Experiment with tile sizes (e.g., 16, 32) to balance shared memory usage and thread divergence. - - Minimize shared memory bank conflicts by ensuring contiguous thread access. -- **Performance**: - - Include data transfer times in measurements, as they are significant for GPU workloads. - - Run multiple iterations per test case to reduce timing variability. -- **Debugging**: - - Validate CUDA results against `output.raw` to ensure correctness. - - Use small matrices for initial testing (e.g., 64x64). - - Check CUDA documentation for memory management and kernel launch syntax. +## Performance Summary (Time in seconds) + +| Test | Dimensions (m×n×p) | Naive CPU (s) | Blocked CPU (s) | Parallel CPU (s) | Naive CUDA (s) | Tiled CUDA (s) | Tiled CUDA vs Naive CUDA | Tiled CUDA vs Parallel CPU | +| :--: | :----------------: | :-----------: | :-------------: | :--------------: | :------------: | :------------: | :----------------------: | :------------------------: | +| 0 | 64×64×64 | 0.0020 | 0.0012 | 0.0008 | 0.000034 | 0.000020 | 1.70× | 40.00× | +| 1 | 128×64×128 | 0.0100 | 0.0095 | 0.0025 | 0.000037 | 0.000026 | 1.42× | 96.15× | +| 2 | 100×128×56 | 0.0060 | 0.0051 | 0.0022 | 0.000044 | 0.000028 | 1.57× | 78.57× | +| 3 | 128×64×128 | 0.0090 | 0.0092 | 0.0028 | 0.000039 | 0.000025 | 1.56× | 112.00× | +| 4 | 32×128×32 | 0.0018 | 0.0010 | 0.0020 | 0.000043 | 0.000029 | 1.48× | 68.97× | +| 5 | 200×100×256 | 0.0470 | 0.0550 | 0.0120 | 0.000075 | 0.000066 | 1.14× | 181.82× | +| 6 | 256×256×256 | 0.1540 | 0.1820 | 0.0370 | 0.000181 | 0.000150 | 1.21× | 246.67× | +| 7 | 256×300×256 | 0.1850 | 0.2150 | 0.0460 | 0.000197 | 0.000178 | 1.11× | 258.43× | +| 8 | 64×128×64 | 0.0110 | 0.0060 | 0.0020 | 0.000044 | 0.000027 | 1.63× | 74.07× | +| 9 | 256×256×257 | 0.1560 | 0.1860 | 0.0390 | 0.000187 | 0.000152 | 1.23× | 256.58× | + + +Note: + +-Naive CUDA” and “Tiled CUDA” times include kernel execution only (recorded via cudaEvent). + +-CPU times are reported from Homework 2 (naive CPU, blocked CPU, parallel OpenMP). + +-Speedup columns are computed as: + +-Tiled vs Naive CUDA = (Naive CUDA time) ÷ (Tiled CUDA time) + +-iled vs Parallel CPU = (Parallel CPU time) ÷ (Tiled CUDA time) --- +## Observations + +-Both CUDA kernels produced results matching the reference (output.raw) with ≤ 1e−3 tolerance. + +-Tiled CUDA consistently outperforms Naive CUDA, especially as matrix size grows, because shared memory + reduces global‐memory traffic. + +-For small matrices (e.g., 64×64×64), GPU launch overhead and data transfers dominate, so speedup over + CPU is modest. + +-For large matrices (e.g., 256×256×256), Tiled CUDA achieves > 200× speedup over the parallel CPU version. + +-Experimenting with other tile widths (e.g., 32) showed similar improvements but with slightly higher + shared‐memory usage; TILE_WIDTH = 16 balanced register/shared usage well on our GPU. -Good luck, and enjoy accelerating matrix multiplication with CUDA! +## Conclusion +Implementing matrix multiplication with CUDA demonstrates the GPU’s parallelism and high‐bandwidth memory. +The naive CUDA kernel is straightforward but limited by repeated global‐memory accesses. By contrast, the +tiled implementation leverages shared memory and thread cooperation, achieving dramatic speedups over both +naive GPU and optimized CPU approaches. Effective CUDA programming requires careful management of memory +hierarchies (global vs. shared) and synchronization to maximize throughput. \ No newline at end of file diff --git a/main.cu b/main.cu index 50df6a1..5dacbc1 100644 --- a/main.cu +++ b/main.cu @@ -1,43 +1,141 @@ #include #include -#include +#include #include +#include +#include + +#define TILE_WIDTH 16 __global__ void naive_cuda_matmul(float *C, float *A, float *B, uint32_t m, uint32_t n, uint32_t p) { - // TODO: Implement naive CUDA matrix multiplication + int row = blockIdx.y * blockDim.y + threadIdx.y; + int col = blockIdx.x * blockDim.x + threadIdx.x; + + if (row < m && col < p) { + float sum = 0.0f; + for (int k = 0; k < n; ++k) { + sum += A[row * n + k] * B[k * p + col]; + } + C[row * p + col] = sum; + } } __global__ void tiled_cuda_matmul(float *C, float *A, float *B, uint32_t m, uint32_t n, uint32_t p, uint32_t tile_width) { - // TODO: Implement tiled CUDA matrix multiplication + __shared__ float tile_A[TILE_WIDTH][TILE_WIDTH]; + __shared__ float tile_B[TILE_WIDTH][TILE_WIDTH]; + + int row = blockIdx.y * tile_width + threadIdx.y; + int col = blockIdx.x * tile_width + threadIdx.x; + float value = 0.0f; + + for (int ph = 0; ph < (n + tile_width - 1) / tile_width; ++ph) { + if (row < m && ph * tile_width + threadIdx.x < n) + tile_A[threadIdx.y][threadIdx.x] = A[row * n + ph * tile_width + threadIdx.x]; + else + tile_A[threadIdx.y][threadIdx.x] = 0.0f; + + if (col < p && ph * tile_width + threadIdx.y < n) + tile_B[threadIdx.y][threadIdx.x] = B[(ph * tile_width + threadIdx.y) * p + col]; + else + tile_B[threadIdx.y][threadIdx.x] = 0.0f; + + __syncthreads(); + + for (int k = 0; k < tile_width; ++k) + value += tile_A[threadIdx.y][k] * tile_B[k][threadIdx.x]; + + __syncthreads(); + } + + if (row < m && col < p) + C[row * p + col] = value; } -bool validate_result(const std::string &result_file, const std::string &reference_file) { - // TODO: Implement result validation (same as Assignment 2) +void read_matrix(const std::string &path, std::vector &matrix, size_t size) { + std::ifstream file(path, std::ios::binary); + matrix.resize(size); + file.read(reinterpret_cast(matrix.data()), size * sizeof(float)); } -int main(int argc, char *argv[]) { +void write_matrix(const std::string &path, const std::vector &matrix) { + std::ofstream file(path, std::ios::binary); + file.write(reinterpret_cast(matrix.data()), matrix.size() * sizeof(float)); +} + +bool validate_result(const std::string &result_path, const std::string &ref_path, size_t size) { + std::vector result(size), ref(size); + read_matrix(result_path, result, size); + read_matrix(ref_path, ref, size); + for (size_t i = 0; i < size; ++i) { + if (fabs(result[i] - ref[i]) > 1e-3f) return false; + } + return true; +} + +int main() { + const std::string base_dir = "E:/Abo Akademi University-Master Program/First Academic Year 2024-2025/4. Fourth Period/Paralell Computing/Homework-3-main/data/"; + + for (int case_number = 0; case_number <= 9; ++case_number) { + std::string base = base_dir + std::to_string(case_number) + "/"; + std::ifstream meta(base + "meta.txt"); + uint32_t m, n, p; + meta >> m >> n >> p; + size_t size_A = m * n, size_B = n * p, size_C = m * p; + std::vector A, B, C(size_C); + read_matrix(base + "input0.raw", A, size_A); + read_matrix(base + "input1.raw", B, size_B); - // TODO: Read input0.raw (matrix A) and input1.raw (matrix B) + float *d_A, *d_B, *d_C; + cudaMalloc(&d_A, size_A * sizeof(float)); + cudaMalloc(&d_B, size_B * sizeof(float)); + cudaMalloc(&d_C, size_C * sizeof(float)); - // TODO: Use cudaMalloc and cudaMemcpy for GPU memory + cudaMemcpy(d_A, A.data(), size_A * sizeof(float), cudaMemcpyHostToDevice); + cudaMemcpy(d_B, B.data(), size_B * sizeof(float), cudaMemcpyHostToDevice); - // Measure naive CUDA performance - // TODO: Launch naive_cuda_matmul kernel + dim3 dimBlock(TILE_WIDTH, TILE_WIDTH); + dim3 dimGrid((p + TILE_WIDTH - 1) / TILE_WIDTH, (m + TILE_WIDTH - 1) / TILE_WIDTH); - // TODO: Write naive CUDA result to file and validate - // Measure tiled CUDA performance + cudaEvent_t start, stop; + cudaEventCreate(&start); + cudaEventCreate(&stop); - // TODO: Launch tiled_cuda_matmul kernel + // Naive + cudaEventRecord(start); + naive_cuda_matmul<<>>(d_C, d_A, d_B, m, n, p); + cudaEventRecord(stop); + cudaEventSynchronize(stop); - // TODO: Write tiled CUDA result to file and validate + float naive_ms = 0; + cudaEventElapsedTime(&naive_ms, start, stop); - // Print performance results - std::cout << "Case " << case_number << " (" << m << "x" << n << "x" << p << "):\n"; - std::cout << "Naive CUDA time: " << naive_cuda_time << " seconds\n"; - std::cout << "Tiled CUDA time: " << tiled_cuda_time << " seconds\n"; + cudaMemcpy(C.data(), d_C, size_C * sizeof(float), cudaMemcpyDeviceToHost); + write_matrix(base + "naive_result.raw", C); + bool naive_valid = validate_result(base + "naive_result.raw", base + "output.raw", size_C); - // Clean up + // Tiled + cudaEventRecord(start); + tiled_cuda_matmul<<>>(d_C, d_A, d_B, m, n, p, TILE_WIDTH); + cudaEventRecord(stop); + cudaEventSynchronize(stop); + float tiled_ms = 0; + cudaEventElapsedTime(&tiled_ms, start, stop); + + cudaMemcpy(C.data(), d_C, size_C * sizeof(float), cudaMemcpyDeviceToHost); + write_matrix(base + "tiled_result.raw", C); + bool tiled_valid = validate_result(base + "tiled_result.raw", base + "output.raw", size_C); + + printf("Case %d (%dx%dx%d):\n", case_number, m, n, p); + printf("Naive CUDA time: %.4f s [%s]\n", naive_ms / 1000.0f, naive_valid ? "OK" : "FAIL"); + printf("Tiled CUDA time: %.4f s [%s]\n\n", tiled_ms / 1000.0f, tiled_valid ? "OK" : "FAIL"); + + cudaFree(d_A); + cudaFree(d_B); + cudaFree(d_C); + cudaEventDestroy(start); + cudaEventDestroy(stop); + } return 0; -} \ No newline at end of file +}