Hardware/GPU × LLM

FlashKernel

Custom CUDA C++ and Triton kernels for transformer inference, targeting the critical path of attention computation, activation fusion, and KV-cache management — benchmarked with Nsight Compute profiling on NVIDIA T4.

CUDA · Triton · Nsight Compute · PyTorch
CUDA kernel engineering is the scarcest skill in ML infrastructure. This project implements the core kernels of transformer inference from scratch — tiled attention with online softmax, fused activations, rotary embeddings, and paged KV-cache — in both CUDA C++ and Triton. Every kernel is profiled with NVIDIA Nsight Compute and benchmarked against PyTorch eager, torch.compile, and cuBLAS baselines on a T4 GPU.
CUDA C++ Triton Nsight Compute PyTorch C++ Extensions CMake Docker

Kernel inventory

Kernel CUDA C++ Triton Key technique
Tiled FlashAttention Online softmax, shared memory tiling
Fused GeLU + Linear Eliminates HBM round-trip
RoPE Embedding Precomputed sin/cos, fused with attention
Paged KV-Cache Block-level virtual memory, page table
Parallel Reduction Warp-level shuffle + shared memory tree

Technical approach

  • Memory hierarchy mastery: Shared memory tiling sized for T4's 48 KB L1, bank conflict avoidance, register pressure tuning for SM 7.5.
  • Warp-level programming: __shfl_down_sync reductions, cooperative groups for cross-warp synchronization.
  • Kernel fusion: GeLU activation computed in-register between matmul and HBM write — eliminates one full memory round-trip.
  • Profiling-driven: Every optimization guided by Nsight Compute metrics — occupancy, memory throughput, warp stall analysis, roofline position.
  • End-to-end integration: Kernels plugged into GPT-2 (124M) via PyTorch C++ extensions, measuring real tokens/sec improvement.

Benchmarks

All measurements on NVIDIA T4 16 GB, CUDA 12.x, PyTorch 2.x. Averaged over 100 warmup + 1000 timed iterations.

Kernel (seq=2048) PyTorch Eager torch.compile Triton (ours) CUDA C++ (ours)
FlashAttention
Fused GeLU+Linear
RoPE
Paged KV-Cache

Results will be populated from real benchmark runs. Nsight Compute profiles committed to repository.

Architecture diagram

flowchart TB Input["Input tokens"] --> Embed["Token + RoPE embedding
Custom CUDA kernel"] Embed --> Attn["Tiled FlashAttention
Shared memory, online softmax"] Attn --> KV["Paged KV-Cache
Block-level virtual memory"] KV --> Fused["Fused GeLU + Linear
Single kernel, no HBM roundtrip"] Fused --> Norm["LayerNorm + Residual"] Norm --> Next["Next layer / Output"] style Embed fill:#eff6ff,stroke:#2563eb,color:#0f172a style Attn fill:#eff6ff,stroke:#2563eb,color:#0f172a style KV fill:#eff6ff,stroke:#2563eb,color:#0f172a style Fused fill:#eff6ff,stroke:#2563eb,color:#0f172a