parallelcomputingabo · MiaNguyen10 · May 26, 2025
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -0,0 +1,6 @@
+{
+    "files.associations": {
+        "*.ejs": "html",
+        "vector": "cpp"
+    }
+}
diff --git a/README.md b/README.md
@@ -9,199 +9,42 @@
 
 ---
 
-### Assignment Overview
+## Performance Measurement
 
-Welcome to the third homework assignment of the Parallel Programming course!
-In Assignment 2, you optimized matrix multiplication using cache-friendly blocked multiplication and OpenMP for CPU
-parallelism. In this assignment, you will take matrix multiplication to the GPU using **CUDA**, NVIDIA’s parallel
-computing platform. Your task is to implement matrix multiplication on the GPU, optimize it using CUDA-specific
-techniques, and compare its performance with your CPU-based implementations from Assignment 2.
-
-You will implement:
-
-1. **Naive CUDA Matrix Multiplication**: A basic GPU implementation using CUDA kernels.
-2. **Tiled CUDA Matrix Multiplication**: An optimized version using shared memory to improve memory access patterns.
-3. **Performance Comparison**: Measure and compare the performance of both CUDA implementations against your Assignment
-   2 implementations (naive, blocked, and parallel).
-
-This assignment introduces CUDA programming, including kernel launches, thread grids, blocks, and memory management,
-while reinforcing the importance of data locality and parallelism.
-
----
-
-### Technical Requirements
-
-#### 1. Naive CUDA Matrix Multiplication
-
-**Why CUDA?**
-
-CUDA allows you to execute parallel computations on NVIDIA GPUs, which have thousands of cores designed for
-data-parallel tasks. Matrix multiplication is an ideal workload for GPUs because it involves independent computations
-for each element of the output matrix.
-
-In the naive CUDA implementation, each thread computes one element of the output matrix \( C \). The GPU organizes
-threads into a grid of thread blocks, where each block contains a group of threads (e.g., 16x16 threads).
-
-**Naive CUDA Matrix Multiplication**
-
-Assume matrices \( A \) \( m x n \), \( B \) \( n x p \), and \( C \) \( m x p \) are stored in
-row-major order in GPU global memory:
-
-```c
-__global__ void naive_cuda_matmul(float *C, float *A, float *B, uint32_t m, uint32_t n, uint32_t p) {
-
-}
-```
-
-- **Grid and Block Configuration**: Launch a 2D grid of 2D thread blocks (e.g., 16x16 threads per block).
-- **Memory**: Matrices are stored in GPU global memory. Use `cudaMalloc` and `cudaMemcpy` to allocate and transfer data
-  between host (CPU) and device (GPU).
-- **Task**: Implement the `naive_cuda_matmul` kernel and its host code in the provided `main.cu`. Measure the wall clock
-  time, including data transfer times (host-to-device and device-to-host).
-
-#### 2. Tiled CUDA Matrix Multiplication
-
-**Why Tiling?**
-
-The naive CUDA implementation accesses global memory frequently, which is slow (hundreds of cycles per access). CUDA
-GPUs have **shared memory**, a fast, on-chip memory shared by threads in a block. Tiled matrix multiplication divides
-matrices into tiles (submatrices) that fit into shared memory, reducing global memory accesses and improving
-performance.
-
-**Tiled CUDA Matrix Multiplication**
-
-Assume a tile size of `TILE_WIDTH` (e.g., 16 or 32):
-
-```c
-__global__ void tiled_cuda_matmul(float *C, float *A, float *B, uint32_t m, uint32_t n, uint32_t p, uint32_t tile_width) {
-
-}
-```
-
-- **Shared Memory**: Each block loads tiles of \( A \) and \( B \) into shared memory, computes partial results, and
-  accumulates the sum.
-- **Synchronization**: Use `__syncthreads()` to ensure all threads in a block have loaded data before computation.
-- **Task**: Implement the `tiled_cuda_matmul` kernel and its host code in `main.cu`. Experiment with different tile
-  sizes (e.g., 16, 32) and report the best performance.
-
-#### 3. Performance Measurement
-
-For each test case (0 through 9, using the same `data` folder from Assignment 2):
-
-- Measure the wall clock time for:
-    - **Naive CUDA matrix multiplication** (`naive_cuda_matmul`), including data transfer times.
-    - **Tiled CUDA matrix multiplication** (`tiled_cuda_matmul`), including data transfer times.
-- Compare with Assignment 2 results (naive, blocked, and parallel CPU implementations).
-- Use `cudaEventRecord` and `cudaEventElapsedTime` for accurate GPU timing.
-- Report the times in a table in your `README.md`, including:
-    - Test case number.
-    - Matrix dimensions (\( m \times n \times p \)).
-    - Wall clock time for naive CUDA, tiled CUDA, and Assignment 2 implementations (in seconds).
-    - Speedup of tiled CUDA over naive CUDA and over Assignment 2’s parallel implementation.
-
-**Example Table Format**:
+### Block size = 32, OMP_NUM_THREADS = 2, Tile size = 16
 
 | Test Case | Dimensions (\( m \times n \times p \)) | Naive CPU (s) | Blocked CPU (s) | Parallel CPU (s) | Naive CUDA (s) | Tiled CUDA (s) | Tiled CUDA Speedup (vs. Naive CUDA) | Tiled CUDA Speedup (vs. Parallel CPU) |
 |-----------|----------------------------------------|---------------|-----------------|------------------|----------------|----------------|-------------------------------------|---------------------------------------|
-|         |                         |      |           |             |          |          |                               |                                 |
-
----
-
-### Matrix Storage and Memory Management
-
-- Continue using row-major order for matrices.
-- Use CUDA memory management (`cudaMalloc`, `cudaMemcpy`, `cudaFree`) for GPU data.
-- Reuse the same input/output format as Assignment 2:
-    - Input files: `data/<case>/input0.raw` (matrix \( A \)) and `input1.raw` (matrix \( B \)).
-    - Output file: `data/<case>/result.raw` (matrix \( C \)).
-    - Reference file: `data/<case>/output.raw` for validation.
+| 0         | 64 x 64 x 64           | 0.00100017     | 0.00100017       | 0.00200009        |  0.0227328      |  0.0161792       |  1.40506x  | 0.12362x |
+| 1         | 128 x 64 x 128         | 0.00200009     | 0.00199986       | 0.00199986        |  0.0280576      |  0.0188416       |  1.48913x  | 0.10614x |
+| 2         | 100 x 128 x 56         | 0.00199986     | 0.00100017       | 0.00100017        |  0.0241664      |  0.0155648       |  1.55263x  | 0.06426x |
+| 3         | 128 x 64 x 128         | 0.00300002     | 0.00300002       | 0.00100017        |  0.0274432      |  0.0198656       |  1.38144x  | 0.05035x |
+| 4         | 32 x 128 x 32          | 0              | 0                | 0.00200009        |  0.0233472      |  0.0149504       |  1.56164x  | 0.13378x |
+| 5         | 200 x 100 x 256        | 0.0109999      | 0.0149999        | 0.00199986        |  0.0589632      |  0.0393984       |  1.49659x  | 0.05076x |
+| 6         | 256 x 256 x 256        | 0.0380001      | 0.04             | 0.00600004        |  0.149901       |  0.0890048       |  1.68419x  | 0.06741x |
+| 7         | 256 x 300 x 256        | 0.0470002      | 0.0500002        | 0.00900006        |  0.17111        |  0.104224        |  1.64176x  | 0.08635x |
+| 8         | 64 x 128 x 64          | 0.00100017     | 0.000999928      | 0.00299978        |  0.0278528      |  0.0182272       |  1.52809x  | 0.16458x |
+| 9         | 256 x 256 x 257        | 0.0350001      | 0.0410001        | 0.00699997        |  0.157043       |  0.0924416       |  1.69884x  | 0.07572x |
 
 ---
 
-### Build Instructions
-
-- Use the provided `CMakeLists.txt`, which includes CUDA support.
-- **Requirements**:
-    - NVIDIA GPU with CUDA support.
-    - CUDA Toolkit installed (version 11.x or later recommended).
-    - CMake with CUDA language support.
-- **Linux/Mac**:
-    - Run `cmake -DCMAKE_CUDA_COMPILER=nvcc .` to generate a Makefile, then `make`.
-- **Windows**:
-    - Use Visual Studio with CUDA toolkit or MinGW with `cmake -G "MinGW Makefiles"`.
-- Test with the same test cases (0–9) as Assignment 2.
-
----
-
-### Submission Requirements
-
-#### Fork and Clone the Repository
-
-- Fork the Assignment 3 repository (provided separately).
-- Clone your fork:
-  ```bash
-  git clone https://github.com/parallelcomputingabo/Homework-3.git
-  cd Homework-3
-  ```
-
-#### Create a New Branch
-
-```bash
-git checkout -b student-name
-```
-
-#### Implement Your Solution
-
-- Modify the provided `main.cu` to implement `naive_cuda_matmul` and `tiled_cuda_matmul`.
-- Update `README.md` with your performance results table.
-
-#### Commit and Push
-
-```bash
-git add .
-git commit -m "student-name: Implemented CUDA matrix multiplication"
-git push origin student-name
-```
-
-#### Submit a Pull Request (PR)
-
-- Create a pull request from your branch to the base repository’s `main` branch.
-- Include a description of your CUDA optimizations and any challenges faced.
-
----
-
-### Grading (100 Points Total)
-
-| Subtask                                       | Points  |
-|-----------------------------------------------|---------|
-| Correct implementation of `naive_cuda_matmul` | 30      |
-| Correct implementation of `tiled_cuda_matmul` | 30      |
-| Accurate performance measurements             | 20      |
-| Performance results table in `README.md`      | 10      |
-| Code clarity, commenting, and organization    | 10      |
-| **Total**                                     | **100** |
-
----
-
-### Tips for Success
-
-- **Naive CUDA**:
-    - Ensure correct grid and block dimensions (e.g., `dim3 threadsPerBlock(16, 16)`).
-    - Check for CUDA errors using `cudaGetLastError` and `cudaDeviceSynchronize`.
-- **Tiled CUDA**:
-    - Experiment with tile sizes (e.g., 16, 32) to balance shared memory usage and thread divergence.
-    - Minimize shared memory bank conflicts by ensuring contiguous thread access.
-- **Performance**:
-    - Include data transfer times in measurements, as they are significant for GPU workloads.
-    - Run multiple iterations per test case to reduce timing variability.
-- **Debugging**:
-    - Validate CUDA results against `output.raw` to ensure correctness.
-    - Use small matrices for initial testing (e.g., 64x64).
-    - Check CUDA documentation for memory management and kernel launch syntax.
-
----
-
-
-
-Good luck, and enjoy accelerating matrix multiplication with CUDA!
+### Block size = 32, OMP_NUM_THREADS = 2, Tile size = 32
 
+| Test Case | Dimensions (\( m \times n \times p \)) | Naive CPU (s) | Blocked CPU (s) | Parallel CPU (s) | Naive CUDA (s) | Tiled CUDA (s) | Tiled CUDA Speedup (vs. Naive CUDA) | Tiled CUDA Speedup (vs. Parallel CPU) |
+|-----------|----------------------------------------|---------------|-----------------|------------------|----------------|----------------|-------------------------------------|---------------------------------------|
+| 0         | 64 x 64 x 64           | 0.00100017     | 0.00100017       | 0.00200009        |  0.024576       |  0.0190464       |  1.29032x  | 0.10501x |
+| 1         | 128 x 64 x 128         | 0.00200009     | 0.00199986       | 0.00199986        |  0.031744       |  0.0202752       |  1.56566x  | 0.09866x |
+| 2         | 100 x 128 x 56         | 0.00199986     | 0.00100017       | 0.00100017        |  0.0311296      |  0.0210944       |  1.47573x  | 0.04741x |
+| 3         | 128 x 64 x 128         | 0.00300002     | 0.00300002       | 0.00100017        |  0.0268288      |  0.0186368       |  1.43956x  | 0.05367x |
+| 4         | 32 x 128 x 32          | 0              | 0                | 0.00200009        |  0.0311296      |  0.0202752       |  1.53535x  | 0.09865x |
+| 5         | 200 x 100 x 256        | 0.0109999      | 0.0149999        | 0.00199986        |  0.0488832      |  0.035296        |  1.38495x  | 0.05666x |
+| 6         | 256 x 256 x 256        | 0.0380001      | 0.04             | 0.00600004        |  0.148512       |  0.0826432       |  1.79703x  | 0.0726x  |
+| 7         | 256 x 300 x 256        | 0.0470002      | 0.0500002        | 0.00900006        |  0.134701       |  0.078944        |  1.70628x  | 0.114x   |
+| 8         | 64 x 128 x 64          | 0.00100017     | 0.000999928      | 0.00299978        |  0.0309248      |  0.0206848       |  1.49505x  | 0.14502x |
+| 9         | 256 x 256 x 257        | 0.0350001      | 0.0410001        | 0.00699997        |  0.173805       |  0.0956608       |  1.81689x  | 0.07317x |
+
+### Conclusion
+
+- Tiled CUDA is faster than Naive CUDA for all test cases and tile size. The speed up ranges from 1.3x to 1.8x.
+- CUDA is slower than CPU implementations, especially compared to the Parallel CPU. Parallel CPU is the fastest in almost all cases.
+- Tile size 32 gives slightly better speedup for Tiled CUDA compared to tile size 16 in most cases.
diff --git a/build/.cmake/api/v1/query/client-vscode/query.json b/build/.cmake/api/v1/query/client-vscode/query.json
@@ -0,0 +1 @@
+{"requests":[{"kind":"cache","version":2},{"kind":"codemodel","version":2},{"kind":"toolchains","version":1},{"kind":"cmakeFiles","version":1}]}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"requests":[{"kind":"cache","version":2},{"kind":"codemodel","version":2},{"kind":"toolchains","version":1},{"kind":"cmakeFiles","version":1}]}