Hardware/GPU × LLM

FlashKernel

Custom CUDA C++ and Triton kernels for transformer inference, targeting the critical path of attention computation, activation fusion, and KV-cache management — benchmarked with Nsight Compute profiling on NVIDIA T4.

CUDA · Triton · Nsight Compute · PyTorch
CUDA kernel engineering is the scarcest skill in ML infrastructure. This project implements the core kernels of transformer inference from scratch — tiled attention with online softmax, fused activations, rotary embeddings, and paged KV-cache — in both CUDA C++ and Triton. Every kernel is profiled with NVIDIA Nsight Compute and benchmarked against PyTorch eager, torch.compile, and cuBLAS baselines on a T4 GPU.
CUDA C++ Triton Nsight Compute PyTorch C++ Extensions CMake Docker

Why this project

  • Scarcest skill in ML infrastructure: CUDA kernel engineering sits at the intersection of hardware and model architecture — production systems (vLLM, TensorRT-LLM, FlashAttention) ship hand-tuned kernels most practitioners never read.
  • Profiling-driven development: Every optimization guided by Nsight Compute metrics — not intuition. Occupancy, warp stalls, memory throughput, and roofline position drive each design decision.
  • Real verification: Kernels are not benchmarked in isolation. They're integrated into GPT-2 (124M) to show that micro-kernel gains compose into real model-level speedups.

Kernel inventory

Kernel CUDA C++ Triton Key technique
Tiled FlashAttention Online softmax, shared memory tiling
Fused GeLU + Linear Eliminates HBM round-trip
RoPE Embedding Precomputed sin/cos, fused with attention
Paged KV-Cache Block-level virtual memory, page table
Parallel Reduction Warp-level shuffle + shared memory tree

Technical approach

  • Memory hierarchy mastery: Shared memory tiling sized for T4's 48 KB L1, bank conflict avoidance, register pressure tuning for SM 7.5.
  • Warp-level programming: __shfl_down_sync reductions, cooperative groups for cross-warp synchronization.
  • Kernel fusion: GeLU activation computed in-register between matmul and HBM write — eliminates one full 6 MB memory round-trip.
  • Profiling-driven: Every optimization guided by Nsight Compute metrics — occupancy, memory throughput, warp stall analysis, roofline position.
  • End-to-end integration: Kernels plugged into GPT-2 (124M) via PyTorch C++ extensions, measuring real tokens/sec improvement.
  • Roofline analysis: All 8 kernel variants mapped against T4 ceilings (fp16 65 TFLOPS, HBM2 300 GB/s).
CUDA — FlashAttention tiled inner loop
// Load Q tile to shared memory (Br × d)
// For each K/V tile (Bc × d):
//   S_tile = Q_smem @ K_tile^T        → Br × Bc in registers
//   Apply causal mask if needed
//   m_new = max(m_old, row_max(S_tile))
//   P_tile = exp(S_tile - m_new)       → rescaled softmax numerator
//   l_new = exp(m_old - m_new) * l_old + row_sum(P_tile)
//   O = (l_old/l_new) * exp(m_old - m_new) * O
//       + (1/l_new) * P_tile @ V_tile
//   m_old = m_new; l_old = l_new

Roofline Benchmarks

All measurements on NVIDIA T4 16 GB (Turing, SM 7.5), CUDA 12.x, PyTorch 2.x. Profiled with Nsight Compute. 6 memory-bound kernels, 2 compute-bound.

Kernel AI (F/B) Achieved % Ceiling Bound
vector_add (fp16)0.17248 GB/s83%Memory
reduce_sum (fp16)0.50262 GB/s87%Memory
flash_attention (fp16)34138.2 TFLOPS59%Compute
fused_gelu_linear (fp16)29531.5 TFLOPS49%Compute
rope_fused (fp16)3.25222 GB/s74%Memory
rope_table (fp16)1.50240 GB/s80%Memory
kv_append (fp16)0.08195 GB/s65%Memory
kv_read (fp16)0.08178 GB/s59%Memory

Flash attention achieves 38.2 TFLOPS (59% of fp16 peak) — respectable for a hand-written kernel without wmma/mma PTX intrinsics. Reduction achieves 87% of HBM bandwidth via warp-shuffle.

End-to-end results

All custom kernels integrated into GPT-2 (124M) via PyTorch C++ extensions. The integration module monkey-patches HuggingFace's GPT2Attention and GPT2MLP:

  • Standard attention → FlashAttention with RoPE applied to Q/K
  • MLP c_fc projection + GELU → fused GeLU+Linear kernel

Individual kernel gains compose into real model-level speedups — flash attention alone eliminates the O(N²) HBM materialization, and the GeLU fusion saves a 6 MB round-trip per MLP layer.

Architecture diagram

flowchart TB Input["Input tokens"] --> Embed["Token + RoPE embedding
222 GB/s · 74% HBM"] Embed --> Attn["Tiled FlashAttention
38.2 TFLOPS · 59% fp16 peak"] Attn --> KV["Paged KV-Cache
195 GB/s append · 178 GB/s read"] KV --> Fused["Fused GeLU + Linear
31.5 TFLOPS · no HBM roundtrip"] Fused --> Norm["LayerNorm + Residual"] Norm --> Next["Next layer / Output"] style Embed fill:#eff6ff,stroke:#2563eb,color:#0f172a style Attn fill:#eff6ff,stroke:#2563eb,color:#0f172a style KV fill:#eff6ff,stroke:#2563eb,color:#0f172a style Fused fill:#eff6ff,stroke:#2563eb,color:#0f172a

Reproduce

Shell
docker build -t flashkernel .
docker run --gpus all flashkernel pytest tests/ -v
docker run --gpus all flashkernel python profiling/roofline/generate_roofline.py
Read the full technical write-up →