FlashKernel
Custom CUDA C++ and Triton kernels for transformer inference — tiled FlashAttention, fused GeLU+Linear, RoPE, paged KV-cache — benchmarked with Nsight Compute on NVIDIA T4.
Each project spans 2–3 domains — LLM, robotics, quantum AI, energy systems, brain-computer interfaces, and GPU compute — with real benchmarks, profiling artifacts, and reproducible code.
Custom CUDA C++ and Triton kernels for transformer inference — tiled FlashAttention, fused GeLU+Linear, RoPE, paged KV-cache — benchmarked with Nsight Compute on NVIDIA T4.
Language-grounded robotic manipulation — a VLM planner decomposes natural language instructions into sub-tasks, and RL-trained policies execute each step in MuJoCo simulation.
Foundation model for neural signal decoding — pre-train a transformer on large-scale EEG data, fine-tune for motor imagery BCI with a custom frequency-band attention kernel.
Quantum-classical hybrid optimization for energy grids — QAOA and VQE circuits applied to unit commitment on real ENTSO-E data, benchmarked against classical MILP solvers.