pjt222 / bare-metal Sponsor Star 0 Code Issues Pull requests Hand-tuned NVIDIA SASS kernels for RTX 3070 Ti (GA104, sm_86): 31,910 GFLOPS HGEMM, 41,721 dense-equiv 2:4 sparse, 11,453 GFLOPS Flash Attention, no cuBLAS / cuDNN / PyTorch. 6-chapter tutorial + Chladni-pattern memory layout study. sass performance-engineering cuda nvidia high-performance-computing gpu-computing ampere machine-code gemm cymatics gpu-programming tensor-cores int8-quantization sparse-gemm hgemm flash-attention kernel-optimization sm-86 rtx-3070-ti cuassembler Updated May 9, 2026 Cuda