FlashKernel — Custom CUDA Kernels for Transformer Inference

CUDA kernel engineering is the scarcest skill in ML infrastructure. This project implements the core kernels of transformer inference from scratch — tiled attention with online softmax, fused activations, rotary embeddings, and paged KV-cache — in both CUDA C++ and Triton. Every kernel is profiled with NVIDIA Nsight Compute and benchmarked against PyTorch eager, torch.compile, and cuBLAS baselines on a T4 GPU.

CUDA C++ Triton Nsight Compute PyTorch C++ Extensions CMake Docker

Kernel inventory

Kernel	CUDA C++	Triton	Key technique
Tiled FlashAttention	✓	✓	Online softmax, shared memory tiling
Fused GeLU + Linear	✓	✓	Eliminates HBM round-trip
RoPE Embedding	✓	✓	Precomputed sin/cos, fused with attention
Paged KV-Cache	✓	✓	Block-level virtual memory, page table
Parallel Reduction	✓	✓	Warp-level shuffle + shared memory tree

Technical approach

Memory hierarchy mastery: Shared memory tiling sized for T4's 48 KB L1, bank conflict avoidance, register pressure tuning for SM 7.5.
Warp-level programming: __shfl_down_sync reductions, cooperative groups for cross-warp synchronization.
Kernel fusion: GeLU activation computed in-register between matmul and HBM write — eliminates one full memory round-trip.
Profiling-driven: Every optimization guided by Nsight Compute metrics — occupancy, memory throughput, warp stall analysis, roofline position.
End-to-end integration: Kernels plugged into GPT-2 (124M) via PyTorch C++ extensions, measuring real tokens/sec improvement.

Benchmarks

All measurements on NVIDIA T4 16 GB, CUDA 12.x, PyTorch 2.x. Averaged over 100 warmup + 1000 timed iterations.

Kernel (seq=2048)	PyTorch Eager	torch.compile	Triton (ours)	CUDA C++ (ours)
FlashAttention	—	—	—	—
Fused GeLU+Linear	—	—	—	—
RoPE	—	—	—	—
Paged KV-Cache	—	—	—	—

Results will be populated from real benchmark runs. Nsight Compute profiles committed to repository.

Architecture diagram

flowchart TB Input["Input tokens"] --> Embed["Token + RoPE embedding
Custom CUDA kernel"] Embed --> Attn["Tiled FlashAttention
Shared memory, online softmax"] Attn --> KV["Paged KV-Cache
Block-level virtual memory"] KV --> Fused["Fused GeLU + Linear
Single kernel, no HBM roundtrip"] Fused --> Norm["LayerNorm + Residual"] Norm --> Next["Next layer / Output"] style Embed fill:#eff6ff,stroke:#2563eb,color:#0f172a style Attn fill:#eff6ff,stroke:#2563eb,color:#0f172a style KV fill:#eff6ff,stroke:#2563eb,color:#0f172a style Fused fill:#eff6ff,stroke:#2563eb,color:#0f172a