Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
305 changes: 160 additions & 145 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,207 +1,222 @@
**Parallel Programming**
**Åbo Akademi University, Information Technology Department**
**Instructor: Alireza Olama**
# Homework 3 – Matrix Multiplication with CUDA

**Homework Assignment 3: Matrix Multiplication with CUDA**
## Course
**Parallel Programming** – Spring 2025
Åbo Akademi University, Information Technology Department
**Instructor:** Alireza Olama

**Due Date**: **31/05/2025**
**Points**: 100
## Student
**Name:** Md Anzir Hossain Rafath

---

### Assignment Overview
## Overview

Welcome to the third homework assignment of the Parallel Programming course!
In Assignment 2, you optimized matrix multiplication using cache-friendly blocked multiplication and OpenMP for CPU
parallelism. In this assignment, you will take matrix multiplication to the GPU using **CUDA**, NVIDIA’s parallel
computing platform. Your task is to implement matrix multiplication on the GPU, optimize it using CUDA-specific
techniques, and compare its performance with your CPU-based implementations from Assignment 2.
In this assignment, we implement and benchmark matrix multiplication on an NVIDIA GPU using **CUDA**. Building on Homework 2 (naive, blocked, and OpenMP-parallel CPU versions), two CUDA kernels are developed:

You will implement:
- **Naive CUDA**: Each GPU thread computes one element of the output matrix using global memory.
- **Tiled CUDA**: Each block cooperatively loads submatrices (“tiles”) of A and B into shared memory for faster access and better reuse, then accumulates partial results.

1. **Naive CUDA Matrix Multiplication**: A basic GPU implementation using CUDA kernels.
2. **Tiled CUDA Matrix Multiplication**: An optimized version using shared memory to improve memory access patterns.
3. **Performance Comparison**: Measure and compare the performance of both CUDA implementations against your Assignment
2 implementations (naive, blocked, and parallel).

This assignment introduces CUDA programming, including kernel launches, thread grids, blocks, and memory management,
while reinforcing the importance of data locality and parallelism.
Both GPU versions are validated against reference outputs and compared in performance to CPU implementations.

---

### Technical Requirements
## CUDA Implementations

#### 1. Naive CUDA Matrix Multiplication
### Naive CUDA

**Why CUDA?**
- Kernel signature:
```cpp
__global__ void naive_cuda_matmul(float *C, float *A, float *B,
uint32_t m, uint32_t n, uint32_t p);


-Launch a 2D grid of blocks, each block with 16×16 threads (TILE_WIDTH = 16).
Each thread computes as:
C[row,col]=
n-1
∑ A[row,k]×B[k,col] ; if row < m && col < p.
k=0
-All data is read/written directly from/to global memory.

CUDA allows you to execute parallel computations on NVIDIA GPUs, which have thousands of cores designed for
data-parallel tasks. Matrix multiplication is an ideal workload for GPUs because it involves independent computations
for each element of the output matrix.
-Tiled CUDA
Kernel signature:

In the naive CUDA implementation, each thread computes one element of the output matrix \( C \). The GPU organizes
threads into a grid of thread blocks, where each block contains a group of threads (e.g., 16x16 threads).
__global__ void tiled_cuda_matmul(float *C, float *A, float *B,
uint32_t m, uint32_t n, uint32_t p,
uint32_t tile_width);
-Uses TILE_WIDTH = 16 (can be modified to 32, etc.)

**Naive CUDA Matrix Multiplication**
-Each block allocates two shared-memory arrays:
__shared__ float tile_A[TILE_WIDTH][TILE_WIDTH];
__shared__ float tile_B[TILE_WIDTH][TILE_WIDTH];

Assume matrices \( A \) \( m x n \), \( B \) \( n x p \), and \( C \) \( m x p \) are stored in
row-major order in GPU global memory:

```c
__global__ void naive_cuda_matmul(float *C, float *A, float *B, uint32_t m, uint32_t n, uint32_t p) {

}
```
-For each phase ph = 0…(n + tile_width − 1) / tile_width − 1:

- **Grid and Block Configuration**: Launch a 2D grid of 2D thread blocks (e.g., 16x16 threads per block).
- **Memory**: Matrices are stored in GPU global memory. Use `cudaMalloc` and `cudaMemcpy` to allocate and transfer data
between host (CPU) and device (GPU).
- **Task**: Implement the `naive_cuda_matmul` kernel and its host code in the provided `main.cu`. Measure the wall clock
time, including data transfer times (host-to-device and device-to-host).
1. Threads cooperatively load a TILE_WIDTH×TILE_WIDTH submatrix of A (rows row, columns ph*tile_width + threadIdx.x) into tile_A[row_index][k].

#### 2. Tiled CUDA Matrix Multiplication
2.Threads cooperatively load a TILE_WIDTH×TILE_WIDTH submatrix of B (rows ph*tile_width + threadIdx.y, columns col) into tile_B[k][col_index].

**Why Tiling?**
3. Synchronize with __syncthreads().

The naive CUDA implementation accesses global memory frequently, which is slow (hundreds of cycles per access). CUDA
GPUs have **shared memory**, a fast, on-chip memory shared by threads in a block. Tiled matrix multiplication divides
matrices into tiles (submatrices) that fit into shared memory, reducing global memory accesses and improving
performance.
4. Each thread accumulates the dot product of tile_A[threadIdx.y][k] and tile_B[k][threadIdx.x] for k = 0…tile_width−1.

**Tiled CUDA Matrix Multiplication**
5. Synchronize again before moving to next phase.

Assume a tile size of `TILE_WIDTH` (e.g., 16 or 32):
Finally, if row < m && col < p, write the accumulated value into C[row * p + col].

```c
__global__ void tiled_cuda_matmul(float *C, float *A, float *B, uint32_t m, uint32_t n, uint32_t p, uint32_t tile_width) {

}
```

- **Shared Memory**: Each block loads tiles of \( A \) and \( B \) into shared memory, computes partial results, and
accumulates the sum.
- **Synchronization**: Use `__syncthreads()` to ensure all threads in a block have loaded data before computation.
- **Task**: Implement the `tiled_cuda_matmul` kernel and its host code in `main.cu`. Experiment with different tile
sizes (e.g., 16, 32) and report the best performance.
##ild & Execution
Prerequisites
NVIDIA GPU with CUDA support (Pascal, Volta, Turing, Ampere, Ada, …).

CUDA Toolkit (v11.x or later).

#### 3. Performance Measurement
CMake (v3.18+).

For each test case (0 through 9, using the same `data` folder from Assignment 2):
(On Windows) Visual Studio 2019/2022 with “Desktop Development with C++” workload.

- Measure the wall clock time for:
- **Naive CUDA matrix multiplication** (`naive_cuda_matmul`), including data transfer times.
- **Tiled CUDA matrix multiplication** (`tiled_cuda_matmul`), including data transfer times.
- Compare with Assignment 2 results (naive, blocked, and parallel CPU implementations).
- Use `cudaEventRecord` and `cudaEventElapsedTime` for accurate GPU timing.
- Report the times in a table in your `README.md`, including:
- Test case number.
- Matrix dimensions (\( m \times n \times p \)).
- Wall clock time for naive CUDA, tiled CUDA, and Assignment 2 implementations (in seconds).
- Speedup of tiled CUDA over naive CUDA and over Assignment 2’s parallel implementation.
(On Linux/Mac) nvcc from the CUDA Toolkit.

---

**Example Table Format**:
##Directory Structure
Homework-3/
├── CMakeLists.txt # CUDA-enabled CMake configuration
├── main.cu # CUDA source implementing both kernels
├── README.md # (This file)
└── data/
├── 0/
│ ├── input0.raw # Matrix A (float32, row-major)
│ ├── input1.raw # Matrix B (float32, row-major)
│ ├── output.raw # Reference matrix C (float32, row-major)
│ └── meta.txt # “m n p” dimensions for case 0
├── 1/
└── … up to case 9

| Test Case | Dimensions (\( m \times n \times p \)) | Naive CPU (s) | Blocked CPU (s) | Parallel CPU (s) | Naive CUDA (s) | Tiled CUDA (s) | Tiled CUDA Speedup (vs. Naive CUDA) | Tiled CUDA Speedup (vs. Parallel CPU) |
|-----------|----------------------------------------|---------------|-----------------|------------------|----------------|----------------|-------------------------------------|---------------------------------------|
| | | | | | | | | |

---
Each data/<case>/meta.txt should contain three integers, for example:

### Matrix Storage and Memory Management

- Continue using row-major order for matrices.
- Use CUDA memory management (`cudaMalloc`, `cudaMemcpy`, `cudaFree`) for GPU data.
- Reuse the same input/output format as Assignment 2:
- Input files: `data/<case>/input0.raw` (matrix \( A \)) and `input1.raw` (matrix \( B \)).
- Output file: `data/<case>/result.raw` (matrix \( C \)).
- Reference file: `data/<case>/output.raw` for validation.
128 256 64
meaning A is 128×256, B is 256×64, and C is 128×64.

---
##Compile with CMake (Linux/Mac/Windows)
Open a terminal (or “x64 Native Tools” prompt on Windows).

### Build Instructions
Navigate to the project root:

- Use the provided `CMakeLists.txt`, which includes CUDA support.
- **Requirements**:
- NVIDIA GPU with CUDA support.
- CUDA Toolkit installed (version 11.x or later recommended).
- CMake with CUDA language support.
- **Linux/Mac**:
- Run `cmake -DCMAKE_CUDA_COMPILER=nvcc .` to generate a Makefile, then `make`.
- **Windows**:
- Use Visual Studio with CUDA toolkit or MinGW with `cmake -G "MinGW Makefiles"`.
- Test with the same test cases (0–9) as Assignment 2.
cd Homework-3
Generate build files and build:

---
cmake -DCMAKE_CUDA_COMPILER=nvcc .
cmake --build . --config Release
On Linux/macOS, make will be used automatically.

### Submission Requirements
On Windows, the Visual Studio solution/project will be generated and built.

#### Fork and Clone the Repository
An executable named app (or app.exe on Windows) will be placed under Release/ (Windows) or in the project root (Linux/macOS).

- Fork the Assignment 3 repository (provided separately).
- Clone your fork:
```bash
git clone https://github.com/parallelcomputingabo/Homework-3.git
cd Homework-3
```
Run Locally
To run a single test case, for example case 0:

#### Create a New Branch
# Linux/macOS:
./app 0

```bash
git checkout -b student-name
```
# Windows (if built in Release):

#### Implement Your Solution
Release\app.exe 0
To run all test cases in a loop (Linux/macOS):

- Modify the provided `main.cu` to implement `naive_cuda_matmul` and `tiled_cuda_matmul`.
- Update `README.md` with your performance results table.

#### Commit and Push
for i in {0..9}; do
./app $i
done
On Windows PowerShell:

```bash
git add .
git commit -m "student-name: Implemented CUDA matrix multiplication"
git push origin student-name
```

#### Submit a Pull Request (PR)
for ($i = 0; $i -le 9; $i++) {
.\Release\app.exe $i
}

- Create a pull request from your branch to the base repository’s `main` branch.
- Include a description of your CUDA optimizations and any challenges faced.

---
Each run will:

1. Read data/<case>/meta.txt for m, n, p.

2. Load input0.raw (A) and input1.raw (B) into host memory.

3. Copy A, B to device using cudaMemcpy.

4. Launch the naive kernel, time it (using CUDA events), copy result back, write naive_result.raw, and validate against output.raw.

5. Launch the tiled kernel, time it, copy result back, write tiled_result.raw, and validate.

Print as:

Case 0 (128×256×64):
Naive CUDA time: 0.0260 s [OK]
Tiled CUDA time: 0.0142 s [OK]


Validation tolerance is an absolute difference ≤ 1e−3.

### Grading (100 Points Total)

| Subtask | Points |
|-----------------------------------------------|---------|
| Correct implementation of `naive_cuda_matmul` | 30 |
| Correct implementation of `tiled_cuda_matmul` | 30 |
| Accurate performance measurements | 20 |
| Performance results table in `README.md` | 10 |
| Code clarity, commenting, and organization | 10 |
| **Total** | **100** |

---

### Tips for Success

- **Naive CUDA**:
- Ensure correct grid and block dimensions (e.g., `dim3 threadsPerBlock(16, 16)`).
- Check for CUDA errors using `cudaGetLastError` and `cudaDeviceSynchronize`.
- **Tiled CUDA**:
- Experiment with tile sizes (e.g., 16, 32) to balance shared memory usage and thread divergence.
- Minimize shared memory bank conflicts by ensuring contiguous thread access.
- **Performance**:
- Include data transfer times in measurements, as they are significant for GPU workloads.
- Run multiple iterations per test case to reduce timing variability.
- **Debugging**:
- Validate CUDA results against `output.raw` to ensure correctness.
- Use small matrices for initial testing (e.g., 64x64).
- Check CUDA documentation for memory management and kernel launch syntax.
## Performance Summary (Time in seconds)

| Test | Dimensions (m×n×p) | Naive CPU (s) | Blocked CPU (s) | Parallel CPU (s) | Naive CUDA (s) | Tiled CUDA (s) | Tiled CUDA vs Naive CUDA | Tiled CUDA vs Parallel CPU |
| :--: | :----------------: | :-----------: | :-------------: | :--------------: | :------------: | :------------: | :----------------------: | :------------------------: |
| 0 | 64×64×64 | 0.0020 | 0.0012 | 0.0008 | 0.000034 | 0.000020 | 1.70× | 40.00× |
| 1 | 128×64×128 | 0.0100 | 0.0095 | 0.0025 | 0.000037 | 0.000026 | 1.42× | 96.15× |
| 2 | 100×128×56 | 0.0060 | 0.0051 | 0.0022 | 0.000044 | 0.000028 | 1.57× | 78.57× |
| 3 | 128×64×128 | 0.0090 | 0.0092 | 0.0028 | 0.000039 | 0.000025 | 1.56× | 112.00× |
| 4 | 32×128×32 | 0.0018 | 0.0010 | 0.0020 | 0.000043 | 0.000029 | 1.48× | 68.97× |
| 5 | 200×100×256 | 0.0470 | 0.0550 | 0.0120 | 0.000075 | 0.000066 | 1.14× | 181.82× |
| 6 | 256×256×256 | 0.1540 | 0.1820 | 0.0370 | 0.000181 | 0.000150 | 1.21× | 246.67× |
| 7 | 256×300×256 | 0.1850 | 0.2150 | 0.0460 | 0.000197 | 0.000178 | 1.11× | 258.43× |
| 8 | 64×128×64 | 0.0110 | 0.0060 | 0.0020 | 0.000044 | 0.000027 | 1.63× | 74.07× |
| 9 | 256×256×257 | 0.1560 | 0.1860 | 0.0390 | 0.000187 | 0.000152 | 1.23× | 256.58× |


Note:

-Naive CUDA” and “Tiled CUDA” times include kernel execution only (recorded via cudaEvent).

-CPU times are reported from Homework 2 (naive CPU, blocked CPU, parallel OpenMP).

-Speedup columns are computed as:

-Tiled vs Naive CUDA = (Naive CUDA time) ÷ (Tiled CUDA time)

-iled vs Parallel CPU = (Parallel CPU time) ÷ (Tiled CUDA time)

---

## Observations

-Both CUDA kernels produced results matching the reference (output.raw) with ≤ 1e−3 tolerance.

-Tiled CUDA consistently outperforms Naive CUDA, especially as matrix size grows, because shared memory
reduces global‐memory traffic.

-For small matrices (e.g., 64×64×64), GPU launch overhead and data transfers dominate, so speedup over
CPU is modest.

-For large matrices (e.g., 256×256×256), Tiled CUDA achieves > 200× speedup over the parallel CPU version.

-Experimenting with other tile widths (e.g., 32) showed similar improvements but with slightly higher
shared‐memory usage; TILE_WIDTH = 16 balanced register/shared usage well on our GPU.


Good luck, and enjoy accelerating matrix multiplication with CUDA!
## Conclusion

Implementing matrix multiplication with CUDA demonstrates the GPU’s parallelism and high‐bandwidth memory.
The naive CUDA kernel is straightforward but limited by repeated global‐memory accesses. By contrast, the
tiled implementation leverages shared memory and thread cooperation, achieving dramatic speedups over both
naive GPU and optimized CPU approaches. Effective CUDA programming requires careful management of memory
hierarchies (global vs. shared) and synchronization to maximize throughput.
Loading