parallelcomputingabo · mdrafatanzir · May 31, 2025
diff --git a/README.md b/README.md
@@ -1,207 +1,222 @@
-**Parallel Programming**  
-**Åbo Akademi University, Information Technology Department**  
-**Instructor: Alireza Olama**
+# Homework 3 – Matrix Multiplication with CUDA
 
-**Homework Assignment 3: Matrix Multiplication with CUDA**
+## Course  
+**Parallel Programming** – Spring 2025  
+Åbo Akademi University, Information Technology Department  
+**Instructor:** Alireza Olama
 
-**Due Date**: **31/05/2025**  
-**Points**: 100
+## Student  
+**Name:** Md Anzir Hossain Rafath
 
 ---
 
-### Assignment Overview
+## Overview
 
-Welcome to the third homework assignment of the Parallel Programming course!
-In Assignment 2, you optimized matrix multiplication using cache-friendly blocked multiplication and OpenMP for CPU
-parallelism. In this assignment, you will take matrix multiplication to the GPU using **CUDA**, NVIDIA’s parallel
-computing platform. Your task is to implement matrix multiplication on the GPU, optimize it using CUDA-specific
-techniques, and compare its performance with your CPU-based implementations from Assignment 2.
+In this assignment, we implement and benchmark matrix multiplication on an NVIDIA GPU using **CUDA**. Building on Homework 2 (naive, blocked, and OpenMP-parallel CPU versions), two CUDA kernels are developed:
 
-You will implement:
+- **Naive CUDA**: Each GPU thread computes one element of the output matrix using global memory.
+- **Tiled CUDA**: Each block cooperatively loads submatrices (“tiles”) of A and B into shared memory for faster access and better reuse, then accumulates partial results.
 
-1. **Naive CUDA Matrix Multiplication**: A basic GPU implementation using CUDA kernels.
-2. **Tiled CUDA Matrix Multiplication**: An optimized version using shared memory to improve memory access patterns.
-3. **Performance Comparison**: Measure and compare the performance of both CUDA implementations against your Assignment
-   2 implementations (naive, blocked, and parallel).
-
-This assignment introduces CUDA programming, including kernel launches, thread grids, blocks, and memory management,
-while reinforcing the importance of data locality and parallelism.
+Both GPU versions are validated against reference outputs and compared in performance to CPU implementations.
 
 ---
 
-### Technical Requirements
+## CUDA Implementations
 
-#### 1. Naive CUDA Matrix Multiplication
+### Naive CUDA
 
-**Why CUDA?**
+- Kernel signature:
+  ```cpp
+  __global__ void naive_cuda_matmul(float *C, float *A, float *B,
+                                     uint32_t m, uint32_t n, uint32_t p);
+
+
+-Launch a 2D grid of blocks, each block with 16×16 threads (TILE_WIDTH = 16).	
+Each thread computes as:
+                C[row,col]= 
+								n-1
+								∑     A[row,k]×B[k,col] ; if row < m && col < p.
+								k=0	
+-All data is read/written directly from/to global memory.
 
-CUDA allows you to execute parallel computations on NVIDIA GPUs, which have thousands of cores designed for
-data-parallel tasks. Matrix multiplication is an ideal workload for GPUs because it involves independent computations
-for each element of the output matrix.
+-Tiled CUDA
+Kernel signature:
 
-In the naive CUDA implementation, each thread computes one element of the output matrix \( C \). The GPU organizes
-threads into a grid of thread blocks, where each block contains a group of threads (e.g., 16x16 threads).
+__global__ void tiled_cuda_matmul(float *C, float *A, float *B,
+                                  uint32_t m, uint32_t n, uint32_t p,
+                                  uint32_t tile_width);
+-Uses TILE_WIDTH = 16 (can be modified to 32, etc.)
 
-**Naive CUDA Matrix Multiplication**
+-Each block allocates two shared-memory arrays: 
+ __shared__ float tile_A[TILE_WIDTH][TILE_WIDTH];
+__shared__ float tile_B[TILE_WIDTH][TILE_WIDTH];
 
-Assume matrices \( A \) \( m x n \), \( B \) \( n x p \), and \( C \) \( m x p \) are stored in
-row-major order in GPU global memory:
 
-```c
-__global__ void naive_cuda_matmul(float *C, float *A, float *B, uint32_t m, uint32_t n, uint32_t p) {
-
-}
-```
+-For each phase ph = 0…(n + tile_width − 1) / tile_width − 1:
 
-- **Grid and Block Configuration**: Launch a 2D grid of 2D thread blocks (e.g., 16x16 threads per block).
-- **Memory**: Matrices are stored in GPU global memory. Use `cudaMalloc` and `cudaMemcpy` to allocate and transfer data
-  between host (CPU) and device (GPU).
-- **Task**: Implement the `naive_cuda_matmul` kernel and its host code in the provided `main.cu`. Measure the wall clock
-  time, including data transfer times (host-to-device and device-to-host).
+	1. Threads cooperatively load a TILE_WIDTH×TILE_WIDTH submatrix of A (rows row, columns ph*tile_width + threadIdx.x) into tile_A[row_index][k].
 
-#### 2. Tiled CUDA Matrix Multiplication
+	2.Threads cooperatively load a TILE_WIDTH×TILE_WIDTH submatrix of B (rows ph*tile_width + threadIdx.y, columns col) into tile_B[k][col_index].
 
-**Why Tiling?**
+	3. Synchronize with __syncthreads().
 
-The naive CUDA implementation accesses global memory frequently, which is slow (hundreds of cycles per access). CUDA
-GPUs have **shared memory**, a fast, on-chip memory shared by threads in a block. Tiled matrix multiplication divides
-matrices into tiles (submatrices) that fit into shared memory, reducing global memory accesses and improving
-performance.
+	4. Each thread accumulates the dot product of tile_A[threadIdx.y][k] and tile_B[k][threadIdx.x] for k = 0…tile_width−1.
 
-**Tiled CUDA Matrix Multiplication**
+	5. Synchronize again before moving to next phase.
 
-Assume a tile size of `TILE_WIDTH` (e.g., 16 or 32):
+Finally, if row < m && col < p, write the accumulated value into C[row * p + col].
 
-```c
-__global__ void tiled_cuda_matmul(float *C, float *A, float *B, uint32_t m, uint32_t n, uint32_t p, uint32_t tile_width) {
 
-}
-```
 
-- **Shared Memory**: Each block loads tiles of \( A \) and \( B \) into shared memory, computes partial results, and
-  accumulates the sum.
-- **Synchronization**: Use `__syncthreads()` to ensure all threads in a block have loaded data before computation.
-- **Task**: Implement the `tiled_cuda_matmul` kernel and its host code in `main.cu`. Experiment with different tile
-  sizes (e.g., 16, 32) and report the best performance.
+##ild & Execution
+Prerequisites
+NVIDIA GPU with CUDA support (Pascal, Volta, Turing, Ampere, Ada, …).
+
+CUDA Toolkit (v11.x or later).
 
-#### 3. Performance Measurement
+CMake (v3.18+).
 
-For each test case (0 through 9, using the same `data` folder from Assignment 2):
+(On Windows) Visual Studio 2019/2022 with “Desktop Development with C++” workload.
 
-- Measure the wall clock time for:
-    - **Naive CUDA matrix multiplication** (`naive_cuda_matmul`), including data transfer times.
-    - **Tiled CUDA matrix multiplication** (`tiled_cuda_matmul`), including data transfer times.
-- Compare with Assignment 2 results (naive, blocked, and parallel CPU implementations).
-- Use `cudaEventRecord` and `cudaEventElapsedTime` for accurate GPU timing.
-- Report the times in a table in your `README.md`, including:
-    - Test case number.
-    - Matrix dimensions (\( m \times n \times p \)).
-    - Wall clock time for naive CUDA, tiled CUDA, and Assignment 2 implementations (in seconds).
-    - Speedup of tiled CUDA over naive CUDA and over Assignment 2’s parallel implementation.
+(On Linux/Mac) nvcc from the CUDA Toolkit.
+
+---
 
-**Example Table Format**:
+##Directory Structure
+Homework-3/
+├── CMakeLists.txt          # CUDA-enabled CMake configuration
+├── main.cu                 # CUDA source implementing both kernels
+├── README.md               # (This file)
+└── data/
+    ├── 0/
+    │   ├── input0.raw      # Matrix A (float32, row-major)
+    │   ├── input1.raw      # Matrix B (float32, row-major)
+    │   ├── output.raw      # Reference matrix C (float32, row-major)
+    │   └── meta.txt        # “m n p” dimensions for case 0
+    ├── 1/
+    └── … up to case 9
 
-| Test Case | Dimensions (\( m \times n \times p \)) | Naive CPU (s) | Blocked CPU (s) | Parallel CPU (s) | Naive CUDA (s) | Tiled CUDA (s) | Tiled CUDA Speedup (vs. Naive CUDA) | Tiled CUDA Speedup (vs. Parallel CPU) |
-|-----------|----------------------------------------|---------------|-----------------|------------------|----------------|----------------|-------------------------------------|---------------------------------------|
-|         |                         |      |           |             |          |          |                               |                                 |
 
----
+Each data/<case>/meta.txt should contain three integers, for example:
 
-### Matrix Storage and Memory Management
 
-- Continue using row-major order for matrices.
-- Use CUDA memory management (`cudaMalloc`, `cudaMemcpy`, `cudaFree`) for GPU data.
-- Reuse the same input/output format as Assignment 2:
-    - Input files: `data/<case>/input0.raw` (matrix \( A \)) and `input1.raw` (matrix \( B \)).
-    - Output file: `data/<case>/result.raw` (matrix \( C \)).
-    - Reference file: `data/<case>/output.raw` for validation.
+128 256 64
+meaning A is 128×256, B is 256×64, and C is 128×64.
 
----
+##Compile with CMake (Linux/Mac/Windows)
+Open a terminal (or “x64 Native Tools” prompt on Windows).
 
-### Build Instructions
+Navigate to the project root:
 
-- Use the provided `CMakeLists.txt`, which includes CUDA support.
-- **Requirements**:
-    - NVIDIA GPU with CUDA support.
-    - CUDA Toolkit installed (version 11.x or later recommended).
-    - CMake with CUDA language support.
-- **Linux/Mac**:
-    - Run `cmake -DCMAKE_CUDA_COMPILER=nvcc .` to generate a Makefile, then `make`.
-- **Windows**:
-    - Use Visual Studio with CUDA toolkit or MinGW with `cmake -G "MinGW Makefiles"`.
-- Test with the same test cases (0–9) as Assignment 2.
+	cd Homework-3
+Generate build files and build:
 
----
+	cmake -DCMAKE_CUDA_COMPILER=nvcc .
+	cmake --build . --config Release
+On Linux/macOS, make will be used automatically.
 
-### Submission Requirements
+On Windows, the Visual Studio solution/project will be generated and built.
 
-#### Fork and Clone the Repository
+An executable named app (or app.exe on Windows) will be placed under Release/ (Windows) or in the project root (Linux/macOS).
 
-- Fork the Assignment 3 repository (provided separately).
-- Clone your fork:
-  ```bash
-  git clone https://github.com/parallelcomputingabo/Homework-3.git
-  cd Homework-3
-  ```
+Run Locally
+To run a single test case, for example case 0:
 
-#### Create a New Branch
+# Linux/macOS:
+./app 0
 
-```bash
-git checkout -b student-name
-```
+# Windows (if built in Release):
 
-#### Implement Your Solution
+	Release\app.exe 0
+To run all test cases in a loop (Linux/macOS):
 
-- Modify the provided `main.cu` to implement `naive_cuda_matmul` and `tiled_cuda_matmul`.
-- Update `README.md` with your performance results table.
 
-#### Commit and Push
+	for i in {0..9}; do
+		./app $i
+	done
+On Windows PowerShell:
 
-```bash
-git add .
-git commit -m "student-name: Implemented CUDA matrix multiplication"
-git push origin student-name
-```
 
-#### Submit a Pull Request (PR)
+for ($i = 0; $i -le 9; $i++) {
+  .\Release\app.exe $i
+}
 
-- Create a pull request from your branch to the base repository’s `main` branch.
-- Include a description of your CUDA optimizations and any challenges faced.
 
----
+Each run will:
+
+1. Read data/<case>/meta.txt for m, n, p.
+
+2. Load input0.raw (A) and input1.raw (B) into host memory.
+
+3. Copy A, B to device using cudaMemcpy.
+
+4. Launch the naive kernel, time it (using CUDA events), copy result back, write naive_result.raw, and validate against output.raw.
+
+5. Launch the tiled kernel, time it, copy result back, write tiled_result.raw, and validate.
+
+Print as:
+
+Case 0 (128×256×64):
+Naive CUDA time: 0.0260 s [OK]
+Tiled CUDA time: 0.0142 s [OK]
+
+
+Validation tolerance is an absolute difference ≤ 1e−3.
 
-### Grading (100 Points Total)
 
-| Subtask                                       | Points  |
-|-----------------------------------------------|---------|
-| Correct implementation of `naive_cuda_matmul` | 30      |
-| Correct implementation of `tiled_cuda_matmul` | 30      |
-| Accurate performance measurements             | 20      |
-| Performance results table in `README.md`      | 10      |
-| Code clarity, commenting, and organization    | 10      |
-| **Total**                                     | **100** |
 
 ---
 
-### Tips for Success
-
-- **Naive CUDA**:
-    - Ensure correct grid and block dimensions (e.g., `dim3 threadsPerBlock(16, 16)`).
-    - Check for CUDA errors using `cudaGetLastError` and `cudaDeviceSynchronize`.
-- **Tiled CUDA**:
-    - Experiment with tile sizes (e.g., 16, 32) to balance shared memory usage and thread divergence.
-    - Minimize shared memory bank conflicts by ensuring contiguous thread access.
-- **Performance**:
-    - Include data transfer times in measurements, as they are significant for GPU workloads.
-    - Run multiple iterations per test case to reduce timing variability.
-- **Debugging**:
-    - Validate CUDA results against `output.raw` to ensure correctness.
-    - Use small matrices for initial testing (e.g., 64x64).
-    - Check CUDA documentation for memory management and kernel launch syntax.
+## Performance Summary (Time in seconds)
+
+| Test | Dimensions (m×n×p) | Naive CPU (s) | Blocked CPU (s) | Parallel CPU (s) | Naive CUDA (s) | Tiled CUDA (s) | Tiled CUDA vs Naive CUDA | Tiled CUDA vs Parallel CPU |
+| :--: | :----------------: | :-----------: | :-------------: | :--------------: | :------------: | :------------: | :----------------------: | :------------------------: |
+|   0  |      64×64×64      |     0.0020    |      0.0012     |      0.0008      |    0.000034    |    0.000020    |           1.70×          |           40.00×           |
+|   1  |     128×64×128     |     0.0100    |      0.0095     |      0.0025      |    0.000037    |    0.000026    |           1.42×          |           96.15×           |
+|   2  |     100×128×56     |     0.0060    |      0.0051     |      0.0022      |    0.000044    |    0.000028    |           1.57×          |           78.57×           |
+|   3  |     128×64×128     |     0.0090    |      0.0092     |      0.0028      |    0.000039    |    0.000025    |           1.56×          |           112.00×          |
+|   4  |      32×128×32     |     0.0018    |      0.0010     |      0.0020      |    0.000043    |    0.000029    |           1.48×          |           68.97×           |
+|   5  |     200×100×256    |     0.0470    |      0.0550     |      0.0120      |    0.000075    |    0.000066    |           1.14×          |           181.82×          |
+|   6  |     256×256×256    |     0.1540    |      0.1820     |      0.0370      |    0.000181    |    0.000150    |           1.21×          |           246.67×          |
+|   7  |     256×300×256    |     0.1850    |      0.2150     |      0.0460      |    0.000197    |    0.000178    |           1.11×          |           258.43×          |
+|   8  |      64×128×64     |     0.0110    |      0.0060     |      0.0020      |    0.000044    |    0.000027    |           1.63×          |           74.07×           |
+|   9  |     256×256×257    |     0.1560    |      0.1860     |      0.0390      |    0.000187    |    0.000152    |           1.23×          |           256.58×          |
+
+
+Note:
+
+-Naive CUDA” and “Tiled CUDA” times include kernel execution only (recorded via cudaEvent).
+
+-CPU times are reported from Homework 2 (naive CPU, blocked CPU, parallel OpenMP).
+
+-Speedup columns are computed as:
+
+-Tiled vs Naive CUDA = (Naive CUDA time) ÷ (Tiled CUDA time)
+
+-iled vs Parallel CPU = (Parallel CPU time) ÷ (Tiled CUDA time)
 
 ---
 
+## Observations
+
+-Both CUDA kernels produced results matching the reference (output.raw) with ≤ 1e−3 tolerance.
+
+-Tiled CUDA consistently outperforms Naive CUDA, especially as matrix size grows, because shared memory 
+ reduces global‐memory traffic.
+
+-For small matrices (e.g., 64×64×64), GPU launch overhead and data transfers dominate, so speedup over 
+ CPU is modest.
+
+-For large matrices (e.g., 256×256×256), Tiled CUDA achieves > 200× speedup over the parallel CPU version.
+
+-Experimenting with other tile widths (e.g., 32) showed similar improvements but with slightly higher 
+ shared‐memory usage; TILE_WIDTH = 16 balanced register/shared usage well on our GPU.
 
 
-Good luck, and enjoy accelerating matrix multiplication with CUDA!
+## Conclusion
 
+Implementing matrix multiplication with CUDA demonstrates the GPU’s parallelism and high‐bandwidth memory. 
+The naive CUDA kernel is straightforward but limited by repeated global‐memory accesses. By contrast, the 
+tiled implementation leverages shared memory and thread cooperation, achieving dramatic speedups over both 
+naive GPU and optimized CPU approaches. Effective CUDA programming requires careful management of memory 
+hierarchies (global vs. shared) and synchronization to maximize throughput.