CUDA kernel engineering is the scarcest skill in ML infrastructure. This project implements
the core kernels of transformer inference from scratch — tiled attention with online softmax,
fused activations, rotary embeddings, and paged KV-cache — in both CUDA C++ and Triton.
Every kernel is profiled with NVIDIA Nsight Compute and benchmarked against PyTorch eager,
torch.compile, and cuBLAS baselines on a T4 GPU.
CUDA C++
Triton
Nsight Compute
PyTorch C++ Extensions
CMake
Docker
Kernel inventory
| Kernel | CUDA C++ | Triton | Key technique |
|---|---|---|---|
| Tiled FlashAttention | ✓ | ✓ | Online softmax, shared memory tiling |
| Fused GeLU + Linear | ✓ | ✓ | Eliminates HBM round-trip |
| RoPE Embedding | ✓ | ✓ | Precomputed sin/cos, fused with attention |
| Paged KV-Cache | ✓ | ✓ | Block-level virtual memory, page table |
| Parallel Reduction | ✓ | ✓ | Warp-level shuffle + shared memory tree |
Technical approach
- Memory hierarchy mastery: Shared memory tiling sized for T4's 48 KB L1, bank conflict avoidance, register pressure tuning for SM 7.5.
- Warp-level programming:
__shfl_down_syncreductions, cooperative groups for cross-warp synchronization. - Kernel fusion: GeLU activation computed in-register between matmul and HBM write — eliminates one full memory round-trip.
- Profiling-driven: Every optimization guided by Nsight Compute metrics — occupancy, memory throughput, warp stall analysis, roofline position.
- End-to-end integration: Kernels plugged into GPT-2 (124M) via PyTorch C++ extensions, measuring real tokens/sec improvement.
Benchmarks
All measurements on NVIDIA T4 16 GB, CUDA 12.x, PyTorch 2.x. Averaged over 100 warmup + 1000 timed iterations.
| Kernel (seq=2048) | PyTorch Eager | torch.compile | Triton (ours) | CUDA C++ (ours) |
|---|---|---|---|---|
| FlashAttention | — | — | — | — |
| Fused GeLU+Linear | — | — | — | — |
| RoPE | — | — | — | — |
| Paged KV-Cache | — | — | — | — |
Results will be populated from real benchmark runs. Nsight Compute profiles committed to repository.
Architecture diagram
flowchart TB
Input["Input tokens"] --> Embed["Token + RoPE embedding
Custom CUDA kernel"] Embed --> Attn["Tiled FlashAttention
Shared memory, online softmax"] Attn --> KV["Paged KV-Cache
Block-level virtual memory"] KV --> Fused["Fused GeLU + Linear
Single kernel, no HBM roundtrip"] Fused --> Norm["LayerNorm + Residual"] Norm --> Next["Next layer / Output"] style Embed fill:#eff6ff,stroke:#2563eb,color:#0f172a style Attn fill:#eff6ff,stroke:#2563eb,color:#0f172a style KV fill:#eff6ff,stroke:#2563eb,color:#0f172a style Fused fill:#eff6ff,stroke:#2563eb,color:#0f172a
Custom CUDA kernel"] Embed --> Attn["Tiled FlashAttention
Shared memory, online softmax"] Attn --> KV["Paged KV-Cache
Block-level virtual memory"] KV --> Fused["Fused GeLU + Linear
Single kernel, no HBM roundtrip"] Fused --> Norm["LayerNorm + Residual"] Norm --> Next["Next layer / Output"] style Embed fill:#eff6ff,stroke:#2563eb,color:#0f172a style Attn fill:#eff6ff,stroke:#2563eb,color:#0f172a style KV fill:#eff6ff,stroke:#2563eb,color:#0f172a style Fused fill:#eff6ff,stroke:#2563eb,color:#0f172a