GitHub Monthly Report: 2026-03-01 to 2026-03-31
π Engineering Report (2026-03-01 - 2026-03-31)
π Executive Summary
March 2026 was a monumental month for the AI engineering ecosystem, marked by massive framework upgrades prioritizing Mixture of Experts (MoE) architectures, native low-precision quantization (FP8, MXFP8, NVFP4), and speculative decoding. Major releases including PyTorch 2.11.0, vLLM v0.18.x/v0.17.x, HuggingFace Transformers v5.4.0, and TorchAO v0.17.0 dominated the landscape.
Maintenance health across the ecosystem is exceptionally high. Heavyweight repositories like pytorch/pytorch (2,127 PRs closed) and openxla/xla (1,617 PRs closed) are sustaining massive throughput. The focus across all stacks is clearly on optimizing inference latency for massive models (DeepSeek V3/R1, Qwen3.5, Llama 4) on both AMD and NVIDIAβs newest hardware architectures.
AMD Related Updates
- ROCm 7.2.1 Released: Introduced support for Ubuntu 24.04.4 and JAX 0.8.2, while deprecating the Offline Installer Creator and Ubuntu 24.04.3. It also delivered significant hipBLASLt performance improvements for MXFP8 and MXFP4 GEMMs.
- Agentic Kernel Optimization Breakthroughs: AMD-AGIβs
GEAK-agentlaunched both v1.0.0 and v2.0.0 this month. v2.0.0 introduced a Profiler-Analyzer loop usingrocprof-computetelemetry and an LLM-based evaluator, achieving massive speedups (up to 7.02Γ on ROCm-bench) for Triton kernel optimizations. - PyTorch 2.11 ROCm Enhancements: PyTorch implemented βhipify v2β, completely removing legacy Caffe2 workarounds from the ROCm hipify preprocessing step. It also enabled scaled group mm on gfx950 and group gemm on gfx90a, alongside AOTriton updates to fix sliding window attention NaNs.
- Third-Party Ecosystem Expansion:
vLLMadded ROCm Sparse MLA CUDA graphs, MXFP4 MoE weight pre-shuffling for gfx950, and fused RoPE+KVCache via AITER.SGLangshipped FP8 prefill integration for DeepSeek models on AMD hardware, MHA FP8-KV support, and fused GemmaRMSNorm forward_hip for Qwen3.5.meta-pytorch/monarchadded ROCm/HIP support for its RDMA stack.llm-dintroduced an officialllm-d-rocmimage.
Competitive Analysis
- NVIDIA Blackwell (SM100) Enablement: NVIDIA is rapidly maturing its Blackwell software stack.
TransformerEngine v2.13added deterministic FP8 fused attention and new MXFP8/NVFP4 MoE quantization kernels specifically optimized for SM100. - FlashAttention 4 Ecosystem Adoption: FA4 is officially rolling out. HuggingFace Transformers added FA4 fallback integration, and vLLM integrated it for MLA prefill.
- DeepSeek Hardware Push: DeepSeekβs
DeepEPis actively tuning for NVIDIAβs newest massive-scale hardware, patching NVLink domain over-counts specifically for GB200 NVL72 (MNNVL) architectures. - Deprecation of Legacy Hardware: PyTorch 2.11 removed Volta (SM 7.0) support from its CUDA 12.8+ binaries to support CuDNN 9.15.1, officially forcing legacy V100 users to older builds or source compilation. Furthermore, PyPI wheels now ship with CUDA 13.0 by default.
π Category Updates
π’ AMD Ecosystem (ROCm & AMD-AGI)
AMD-AGI/GEAK-agent
- Key Activity:
- [2026-03-25] π¨ RELEASE: v2.0.0 - Introduced
GEAK-OptimAgentv2(Instruction-to-Triton with multi-offspring evolution and hardware profiling) andGEAK-OpenEvolve(Triton-to-Triton Map-Elites search). - [2026-03-25] π¨ RELEASE: v1.0.0 - Initial release of the Triton Kernel AI Agent architecture (generator, reflector, evaluator, optimizer) and GEAK-eval benchmarks.
- [2026-03-25] π¨ RELEASE: v2.0.0 - Introduced
- Details:
- [2026-03-31] Multiple Doc updates merging README and config fixes.
-
Metrics: 0 PRs 0 Issues (Excluding releases/doc updates)
ROCm/ROCm
- Key Activity:
- [2026-03-25] π¨ RELEASE: rocm-7.2.1 - Added Ubuntu 24.04.4 and JAX 0.8.2 support. Included firmware updates for MI355X, MI350X, MI325X, and MI300X. Discontinued the ROCm Offline Installer Creator.
- Details:
- [2026-03-10] Doc Update: Removed mention of ROCR and CLR from βWhat is ROCmβ page.
- Known Issue: ROCTracer might fail to report kernel operations (ROCTracer is scheduled for EoS by 2026 Q2, replaced by ROCprofiler-SDK).
- Known Issue: AMDGPU SMU driver interface version mismatch on R9700.
-
Metrics: 57 PRs 41 Issues (Healthy closure rate: 52 PRs / 28 Issues closed)
AMD-AGI/Primus
- Key Activity: Active PR development for new model capabilities.
- Details:
- Highlight PR:
turbo update - Highlight PR:
gpt-oss model support sink_sliding_window
- Highlight PR:
-
Metrics: 67 PRs 0 Issues (62 PRs closed)
AMD-AGI/TraceLens
- Key Activity: Bug fixes and new profiling integrations.
- Details:
- Highlight Issue:
[Bug] TraceDiff fails to capture kernels for traces with shallow trees - Highlight PR:
Add origami rooflines to unified performance report - Highlight PR:
add support for traces generated by jax 08 suppress warning
- Highlight Issue:
-
Metrics: 36 PRs 29 Issues (34 PRs / 16 Issues closed)
ROCm/MAD
- Key Activity: Docker and benchmarking updates.
- Details:
- [2026-03-19] Doc Update: Enable large EP microbenchmarks and add blueprints section.
- Highlight PR: Updated docker base container and set PYTHONPATH env for sglang_disagg_inf.
-
Metrics: 6 PRs 0 Issues (6 PRs closed)
π΄ PyTorch Ecosystem
pytorch/pytorch
- Key Activity:
- [2026-03-23] π¨ RELEASE: v2.11.0 - Massive release adding Differentiable Collectives, FlexAttention FlashAttention-4 backend (Hopper/Blackwell), MPS Operator Expansion, XPU Graph Support, and RNN/LSTM GPU Export.
- Details:
- PyPI wheels now default to CUDA 13.0 instead of 12.x.
- Fully removed Caffe2 support from ROCm PyTorchβs hipify preprocessing (βhipify v2β).
- Deprecated the MAGMA backend for linear algebra operations; cuSOLVER is now the default.
- Removed
torch.export.export_for_trainingand the PT2E quantization flow (migrated totorchao).
-
Metrics: 2273 PRs 593 Issues (Exceptional maintenance health: 2127 PRs / 629 Issues closed)
pytorch/ao
- Key Activity:
- [2026-03-30] π¨ RELEASE: v0.17.0 - Added CuteDSL MXFP8 MoE kernels and Per-Head FP8 Quantized Low Precision Attention via FA3.
- Details:
- Declared ABI stability for all CUDA kernels (0.17.0+ works with any PyTorch 2.11+).
- ROCm support added for
scaled_grouped_mmgfx942 fp8 data type.
-
Metrics: 0 PRs 35 Issues
pytorch/vision & pytorch/audio
- Key Activity:
- [2026-03-23] π¨ RELEASE: vision v0.26.0 & audio v2.11.0 - Vision completely removed deprecated video decoding/encoding utilities (migrated users to TorchCodec). Both releases ensure torch 2.11 compatibility.
-
Metrics: 0 PRs 0 Issues
pytorch/FBGEMM
- Key Activity:
- [2026-03-30] π¨ RELEASE: v1.6.0 - TBE performance upgrades, FP16 support for grouped GEMM wgrad/dgrad, and Python 3.14 support.
- Details:
- Optimized ROCm kernels (
group_index_select_or_add_2d_kernel, sparse operations).
- Optimized ROCm kernels (
-
Metrics: 0 PRs 0 Issues
meta-pytorch/monarch
- Key Activity:
- [2026-03-26] π¨ RELEASE: v0.4.0 - EFA support for RDMA, TCP fallback, and new ROCm/HIP support for AMD GPU deployments. Added a built-in observability dashboard.
-
Metrics: 0 PRs 0 Issues
π΅ AI Frameworks & Serving
vllm-project/vllm
- Key Activity:
- [2026-03-20] π¨ RELEASE: v0.18.0 - Added gRPC serving support, GPU-less Render Serving, NGram GPU Speculative Decoding, and removed Ray as a default dependency.
- [2026-03-07] π¨ RELEASE: v0.17.0 - PyTorch 2.10 upgrade, FlashAttention 4 integration, and full support for the Qwen3.5 model family.
- Details:
- [2026-03-31] Release v0.18.1 (Patch): Changed default SM100 MLA prefill backend back to TRT-LLM.
- Heavy ROCm optimization across 0.17 and 0.18: Sparse MLA CUDA graphs, AITER fused RoPE+KVCache, MXFP4 MoE weight pre-shuffling.
-
Metrics: 0 PRs 0 Issues (Tracked via releases)
huggingface/transformers
- Key Activity:
- [2026-03-27] π¨ RELEASE: v5.4.0 - Added VidEoMT, UVDoc, Jina Embeddings v3, Mistral4, PI0, and CHMv2. Flash Attention 2 support now requires version 2.3.3+, and FA4 initial support was merged.
- [2026-03-04] π¨ RELEASE: v5.3.0 - Added EuroBERT, VibeVoice ASR, TimesFM2.5, and Higgs Audio V2.
- Details:
- Removed the
cache_positionargument from the forward signatures of most major models. - Deprecated multiple pipeline tasks in a massive V5 cleanup.
- Removed the
-
Metrics: 597 PRs 168 Issues (Very high health: 564 PRs / 178 Issues closed)
sgl-project/sglang
- Key Activity:
- [2026-03-28] π¨ RELEASE: v0.5.10rc0 - Piecewise CUDA graph enabled by default. Upgraded to Transformers 5.3.0. Added Elastic EP for Partial Failure Tolerance.
- Details:
- Added AMD FP8 prefill integration with radix cache path for DeepSeek models.
- Integrated FlashInfer MXFP8 Kernels for GEMM and MoE operations.
-
Metrics: 0 PRs 0 Issues (Tracked via releases)
volcengine/verl
- Key Activity:
- [2026-03-16] π¨ RELEASE: v0.7.1 - Megatron Model Engine updates supporting MTP training in SFT/RL, new VeOmni training backend, and TorchTitan backend.
-
Metrics: 0 PRs 0 Issues
THUDM/slime
- Key Activity:
- [2026-03-29] π¨ RELEASE: v0.2.4 - Consolidated router stack onto
sgl-router, added a rollout trace timeline viewer. - [2026-03-12] π¨ RELEASE: v0.2.3 - YAML-based
sglang_configsupport and removed FSDP support to focus on active training paths.
- [2026-03-29] π¨ RELEASE: v0.2.4 - Consolidated router stack onto
-
Metrics: 0 PRs 0 Issues
llm-d/llm-d
- Key Activity:
- [2026-03-05] π¨ RELEASE: v0.5.1 - Component updates across inference-scheduler, routing-sidecar, and cuda/xpu/cpu/rocm/hpu image variants.
- Details:
- Added
llm-d-rocmimage and debug variants for CUDA images.
- Added
-
Metrics: 0 PRs 0 Issues
π DeepLearning & System Infrastructure
deepspeedai/DeepSpeed
- Key Activity:
- [2026-03-30] π¨ RELEASE: v0.18.9 - Added Universal Checkpoint for AutoTP and Muon Optimizer Support for ZeRO Stage 3.
- [2026-03-05] π¨ RELEASE: v0.18.7 - EXAONE 4.0 model support for Inference V2.
- Details:
- Fixed ROCm GPU architecture detection by removing unnecessary
shell=True. - Fixed ROCm BF16 conversion intrinsics in inference v2.
- Fixed ROCm GPU architecture detection by removing unnecessary
-
Metrics: 0 PRs 0 Issues
NVIDIA/TransformerEngine
- Key Activity:
- [2026-03-31] π¨ RELEASE: v2.13 - Major update for low precision training (FP8, MXFP8, NVFP4).
- Details:
- Enabled deterministic FP8 fused attention on Blackwell (SM100) GPUs.
- Introduced
GroupedTensorfor MoE expert weights in PyTorch.
-
Metrics: 0 PRs 0 Issues
deepseek-ai/DeepEP
- Key Activity: Active patches for massive-scale hardware routing.
- Details:
- Highlight Issue/PR: Fixing NVLink domain over-count on MNNVL systems (GB200 NVL72) in
detect_accessible_ranks.
- Highlight Issue/PR: Fixing NVLink domain over-count on MNNVL systems (GB200 NVL72) in
-
Metrics: 6 PRs 7 Issues (2 PRs / 1 Issue closed)
NVIDIA/Megatron-LM
- Key Activity:
- [2026-03-20] π¨ RELEASE: core_v0.16.1 - Minor patch fixing async GC in persistent checkpoint worker loops and Mamba Uneven PP fixes.
-
Metrics: 0 PRs 0 Issues
π‘ JAX & Google Ecosystem
jax-ml/jax
- Key Activity:
- [2026-03-18] π¨ RELEASE: jax-v0.9.2 - Made
TypedNdArraya subclass ofnp.ndarray. - [2026-03-02] π¨ RELEASE: jax-v0.9.1 - Modified
jax.shard_mapin Explicit mode to raise errors instead of implicitly resharding if PartitionSpec inputs do not matchin_specs.
- [2026-03-18] π¨ RELEASE: jax-v0.9.2 - Made
-
Metrics: 0 PRs 0 Issues
AI-Hypercomputer/maxtext
- Key Activity:
- [2026-03-23] π¨ RELEASE: maxtext-v0.2.1 - DeepSeek-AI features supported: Conditional Memory via Scalable Lookup (Engram) and Manifold-Constrained Hyper-Connections (mHC).
- [2026-03-06] π¨ RELEASE: maxtext-v0.2.0 - Added Qwen3-Next, Muon optimizer, and DeepSeek V3.1 support.
- Details:
- Ironwood TPU co-designed AI stack announced in coordination with MaxText.
-
Metrics: 241 PRs 15 Issues (Excellent throughput: 238 PRs / 10 Issues closed)
openxla/xla
- Key Activity: Constant maintenance and backend bug fixes.
- Details:
- Highlight PR:
[XLA:GPU][oneAPI][Bugfix] Fix unused globals for SPIR-V backend - Highlight Issue:
HLO verifier missing async pair check for kAllGatherStart/kAllGatherDone
- Highlight PR:
-
Metrics: 1697 PRs 34 Issues (1617 PRs / 13 Issues closed)
triton-lang/triton & tile-ai/tilelang
- Key Activity: Low-level compiler optimization and AMD-specific fixes.
- Details:
- Triton: Highlight PR
[AMD] Fix CanonicalizePointers to handle ptr mergepoints with different ptr promotability - TileLang: Feature requests for atomic ops on shared memory and injecting
Tcgen05Fencepasses.
- Triton: Highlight PR
-
Metrics: Triton: 254 PRs 27 Issues (237 PRs / 14 Issues closed). TileLang: 85 PRs 32 Issues (76 PRs / 25 Issues closed).