πŸ“… Engineering Report (2026-03-01 - 2026-03-31)

πŸš€ Executive Summary

March 2026 was a monumental month for the AI engineering ecosystem, marked by massive framework upgrades prioritizing Mixture of Experts (MoE) architectures, native low-precision quantization (FP8, MXFP8, NVFP4), and speculative decoding. Major releases including PyTorch 2.11.0, vLLM v0.18.x/v0.17.x, HuggingFace Transformers v5.4.0, and TorchAO v0.17.0 dominated the landscape.

Maintenance health across the ecosystem is exceptionally high. Heavyweight repositories like pytorch/pytorch (2,127 PRs closed) and openxla/xla (1,617 PRs closed) are sustaining massive throughput. The focus across all stacks is clearly on optimizing inference latency for massive models (DeepSeek V3/R1, Qwen3.5, Llama 4) on both AMD and NVIDIA’s newest hardware architectures.

  • ROCm 7.2.1 Released: Introduced support for Ubuntu 24.04.4 and JAX 0.8.2, while deprecating the Offline Installer Creator and Ubuntu 24.04.3. It also delivered significant hipBLASLt performance improvements for MXFP8 and MXFP4 GEMMs.
  • Agentic Kernel Optimization Breakthroughs: AMD-AGI’s GEAK-agent launched both v1.0.0 and v2.0.0 this month. v2.0.0 introduced a Profiler-Analyzer loop using rocprof-compute telemetry and an LLM-based evaluator, achieving massive speedups (up to 7.02Γ— on ROCm-bench) for Triton kernel optimizations.
  • PyTorch 2.11 ROCm Enhancements: PyTorch implemented β€œhipify v2”, completely removing legacy Caffe2 workarounds from the ROCm hipify preprocessing step. It also enabled scaled group mm on gfx950 and group gemm on gfx90a, alongside AOTriton updates to fix sliding window attention NaNs.
  • Third-Party Ecosystem Expansion:
    • vLLM added ROCm Sparse MLA CUDA graphs, MXFP4 MoE weight pre-shuffling for gfx950, and fused RoPE+KVCache via AITER.
    • SGLang shipped FP8 prefill integration for DeepSeek models on AMD hardware, MHA FP8-KV support, and fused GemmaRMSNorm forward_hip for Qwen3.5.
    • meta-pytorch/monarch added ROCm/HIP support for its RDMA stack.
    • llm-d introduced an official llm-d-rocm image.

Competitive Analysis

  • NVIDIA Blackwell (SM100) Enablement: NVIDIA is rapidly maturing its Blackwell software stack. TransformerEngine v2.13 added deterministic FP8 fused attention and new MXFP8/NVFP4 MoE quantization kernels specifically optimized for SM100.
  • FlashAttention 4 Ecosystem Adoption: FA4 is officially rolling out. HuggingFace Transformers added FA4 fallback integration, and vLLM integrated it for MLA prefill.
  • DeepSeek Hardware Push: DeepSeek’s DeepEP is actively tuning for NVIDIA’s newest massive-scale hardware, patching NVLink domain over-counts specifically for GB200 NVL72 (MNNVL) architectures.
  • Deprecation of Legacy Hardware: PyTorch 2.11 removed Volta (SM 7.0) support from its CUDA 12.8+ binaries to support CuDNN 9.15.1, officially forcing legacy V100 users to older builds or source compilation. Furthermore, PyPI wheels now ship with CUDA 13.0 by default.

πŸ“‚ Category Updates

🟒 AMD Ecosystem (ROCm & AMD-AGI)

AMD-AGI/GEAK-agent

  • Key Activity:
    • [2026-03-25] 🚨 RELEASE: v2.0.0 - Introduced GEAK-OptimAgentv2 (Instruction-to-Triton with multi-offspring evolution and hardware profiling) and GEAK-OpenEvolve (Triton-to-Triton Map-Elites search).
    • [2026-03-25] 🚨 RELEASE: v1.0.0 - Initial release of the Triton Kernel AI Agent architecture (generator, reflector, evaluator, optimizer) and GEAK-eval benchmarks.
  • Details:
    • [2026-03-31] Multiple Doc updates merging README and config fixes.
  • Metrics: 0 PRs 0 Issues (Excluding releases/doc updates)

ROCm/ROCm

  • Key Activity:
    • [2026-03-25] 🚨 RELEASE: rocm-7.2.1 - Added Ubuntu 24.04.4 and JAX 0.8.2 support. Included firmware updates for MI355X, MI350X, MI325X, and MI300X. Discontinued the ROCm Offline Installer Creator.
  • Details:
    • [2026-03-10] Doc Update: Removed mention of ROCR and CLR from β€œWhat is ROCm” page.
    • Known Issue: ROCTracer might fail to report kernel operations (ROCTracer is scheduled for EoS by 2026 Q2, replaced by ROCprofiler-SDK).
    • Known Issue: AMDGPU SMU driver interface version mismatch on R9700.
  • Metrics: 57 PRs 41 Issues (Healthy closure rate: 52 PRs / 28 Issues closed)

AMD-AGI/Primus

  • Key Activity: Active PR development for new model capabilities.
  • Details:
    • Highlight PR: turbo update
    • Highlight PR: gpt-oss model support sink_sliding_window
  • Metrics: 67 PRs 0 Issues (62 PRs closed)

AMD-AGI/TraceLens

  • Key Activity: Bug fixes and new profiling integrations.
  • Details:
    • Highlight Issue: [Bug] TraceDiff fails to capture kernels for traces with shallow trees
    • Highlight PR: Add origami rooflines to unified performance report
    • Highlight PR: add support for traces generated by jax 08 suppress warning
  • Metrics: 36 PRs 29 Issues (34 PRs / 16 Issues closed)

ROCm/MAD

  • Key Activity: Docker and benchmarking updates.
  • Details:
    • [2026-03-19] Doc Update: Enable large EP microbenchmarks and add blueprints section.
    • Highlight PR: Updated docker base container and set PYTHONPATH env for sglang_disagg_inf.
  • Metrics: 6 PRs 0 Issues (6 PRs closed)

πŸ”΄ PyTorch Ecosystem

pytorch/pytorch

  • Key Activity:
    • [2026-03-23] 🚨 RELEASE: v2.11.0 - Massive release adding Differentiable Collectives, FlexAttention FlashAttention-4 backend (Hopper/Blackwell), MPS Operator Expansion, XPU Graph Support, and RNN/LSTM GPU Export.
  • Details:
    • PyPI wheels now default to CUDA 13.0 instead of 12.x.
    • Fully removed Caffe2 support from ROCm PyTorch’s hipify preprocessing (β€œhipify v2”).
    • Deprecated the MAGMA backend for linear algebra operations; cuSOLVER is now the default.
    • Removed torch.export.export_for_training and the PT2E quantization flow (migrated to torchao).
  • Metrics: 2273 PRs 593 Issues (Exceptional maintenance health: 2127 PRs / 629 Issues closed)

pytorch/ao

  • Key Activity:
    • [2026-03-30] 🚨 RELEASE: v0.17.0 - Added CuteDSL MXFP8 MoE kernels and Per-Head FP8 Quantized Low Precision Attention via FA3.
  • Details:
    • Declared ABI stability for all CUDA kernels (0.17.0+ works with any PyTorch 2.11+).
    • ROCm support added for scaled_grouped_mm gfx942 fp8 data type.
  • Metrics: 0 PRs 35 Issues

pytorch/vision & pytorch/audio

  • Key Activity:
    • [2026-03-23] 🚨 RELEASE: vision v0.26.0 & audio v2.11.0 - Vision completely removed deprecated video decoding/encoding utilities (migrated users to TorchCodec). Both releases ensure torch 2.11 compatibility.
  • Metrics: 0 PRs 0 Issues

pytorch/FBGEMM

  • Key Activity:
    • [2026-03-30] 🚨 RELEASE: v1.6.0 - TBE performance upgrades, FP16 support for grouped GEMM wgrad/dgrad, and Python 3.14 support.
  • Details:
    • Optimized ROCm kernels (group_index_select_or_add_2d_kernel, sparse operations).
  • Metrics: 0 PRs 0 Issues

meta-pytorch/monarch

  • Key Activity:
    • [2026-03-26] 🚨 RELEASE: v0.4.0 - EFA support for RDMA, TCP fallback, and new ROCm/HIP support for AMD GPU deployments. Added a built-in observability dashboard.
  • Metrics: 0 PRs 0 Issues

πŸ”΅ AI Frameworks & Serving

vllm-project/vllm

  • Key Activity:
    • [2026-03-20] 🚨 RELEASE: v0.18.0 - Added gRPC serving support, GPU-less Render Serving, NGram GPU Speculative Decoding, and removed Ray as a default dependency.
    • [2026-03-07] 🚨 RELEASE: v0.17.0 - PyTorch 2.10 upgrade, FlashAttention 4 integration, and full support for the Qwen3.5 model family.
  • Details:
    • [2026-03-31] Release v0.18.1 (Patch): Changed default SM100 MLA prefill backend back to TRT-LLM.
    • Heavy ROCm optimization across 0.17 and 0.18: Sparse MLA CUDA graphs, AITER fused RoPE+KVCache, MXFP4 MoE weight pre-shuffling.
  • Metrics: 0 PRs 0 Issues (Tracked via releases)

huggingface/transformers

  • Key Activity:
    • [2026-03-27] 🚨 RELEASE: v5.4.0 - Added VidEoMT, UVDoc, Jina Embeddings v3, Mistral4, PI0, and CHMv2. Flash Attention 2 support now requires version 2.3.3+, and FA4 initial support was merged.
    • [2026-03-04] 🚨 RELEASE: v5.3.0 - Added EuroBERT, VibeVoice ASR, TimesFM2.5, and Higgs Audio V2.
  • Details:
    • Removed the cache_position argument from the forward signatures of most major models.
    • Deprecated multiple pipeline tasks in a massive V5 cleanup.
  • Metrics: 597 PRs 168 Issues (Very high health: 564 PRs / 178 Issues closed)

sgl-project/sglang

  • Key Activity:
    • [2026-03-28] 🚨 RELEASE: v0.5.10rc0 - Piecewise CUDA graph enabled by default. Upgraded to Transformers 5.3.0. Added Elastic EP for Partial Failure Tolerance.
  • Details:
    • Added AMD FP8 prefill integration with radix cache path for DeepSeek models.
    • Integrated FlashInfer MXFP8 Kernels for GEMM and MoE operations.
  • Metrics: 0 PRs 0 Issues (Tracked via releases)

volcengine/verl

  • Key Activity:
    • [2026-03-16] 🚨 RELEASE: v0.7.1 - Megatron Model Engine updates supporting MTP training in SFT/RL, new VeOmni training backend, and TorchTitan backend.
  • Metrics: 0 PRs 0 Issues

THUDM/slime

  • Key Activity:
    • [2026-03-29] 🚨 RELEASE: v0.2.4 - Consolidated router stack onto sgl-router, added a rollout trace timeline viewer.
    • [2026-03-12] 🚨 RELEASE: v0.2.3 - YAML-based sglang_config support and removed FSDP support to focus on active training paths.
  • Metrics: 0 PRs 0 Issues

llm-d/llm-d

  • Key Activity:
    • [2026-03-05] 🚨 RELEASE: v0.5.1 - Component updates across inference-scheduler, routing-sidecar, and cuda/xpu/cpu/rocm/hpu image variants.
  • Details:
    • Added llm-d-rocm image and debug variants for CUDA images.
  • Metrics: 0 PRs 0 Issues

🟠 DeepLearning & System Infrastructure

deepspeedai/DeepSpeed

  • Key Activity:
    • [2026-03-30] 🚨 RELEASE: v0.18.9 - Added Universal Checkpoint for AutoTP and Muon Optimizer Support for ZeRO Stage 3.
    • [2026-03-05] 🚨 RELEASE: v0.18.7 - EXAONE 4.0 model support for Inference V2.
  • Details:
    • Fixed ROCm GPU architecture detection by removing unnecessary shell=True.
    • Fixed ROCm BF16 conversion intrinsics in inference v2.
  • Metrics: 0 PRs 0 Issues

NVIDIA/TransformerEngine

  • Key Activity:
    • [2026-03-31] 🚨 RELEASE: v2.13 - Major update for low precision training (FP8, MXFP8, NVFP4).
  • Details:
    • Enabled deterministic FP8 fused attention on Blackwell (SM100) GPUs.
    • Introduced GroupedTensor for MoE expert weights in PyTorch.
  • Metrics: 0 PRs 0 Issues

deepseek-ai/DeepEP

  • Key Activity: Active patches for massive-scale hardware routing.
  • Details:
    • Highlight Issue/PR: Fixing NVLink domain over-count on MNNVL systems (GB200 NVL72) in detect_accessible_ranks.
  • Metrics: 6 PRs 7 Issues (2 PRs / 1 Issue closed)

NVIDIA/Megatron-LM

  • Key Activity:
    • [2026-03-20] 🚨 RELEASE: core_v0.16.1 - Minor patch fixing async GC in persistent checkpoint worker loops and Mamba Uneven PP fixes.
  • Metrics: 0 PRs 0 Issues

🟑 JAX & Google Ecosystem

jax-ml/jax

  • Key Activity:
    • [2026-03-18] 🚨 RELEASE: jax-v0.9.2 - Made TypedNdArray a subclass of np.ndarray.
    • [2026-03-02] 🚨 RELEASE: jax-v0.9.1 - Modified jax.shard_map in Explicit mode to raise errors instead of implicitly resharding if PartitionSpec inputs do not match in_specs.
  • Metrics: 0 PRs 0 Issues

AI-Hypercomputer/maxtext

  • Key Activity:
    • [2026-03-23] 🚨 RELEASE: maxtext-v0.2.1 - DeepSeek-AI features supported: Conditional Memory via Scalable Lookup (Engram) and Manifold-Constrained Hyper-Connections (mHC).
    • [2026-03-06] 🚨 RELEASE: maxtext-v0.2.0 - Added Qwen3-Next, Muon optimizer, and DeepSeek V3.1 support.
  • Details:
    • Ironwood TPU co-designed AI stack announced in coordination with MaxText.
  • Metrics: 241 PRs 15 Issues (Excellent throughput: 238 PRs / 10 Issues closed)

openxla/xla

  • Key Activity: Constant maintenance and backend bug fixes.
  • Details:
    • Highlight PR: [XLA:GPU][oneAPI][Bugfix] Fix unused globals for SPIR-V backend
    • Highlight Issue: HLO verifier missing async pair check for kAllGatherStart/kAllGatherDone
  • Metrics: 1697 PRs 34 Issues (1617 PRs / 13 Issues closed)

triton-lang/triton & tile-ai/tilelang

  • Key Activity: Low-level compiler optimization and AMD-specific fixes.
  • Details:
    • Triton: Highlight PR [AMD] Fix CanonicalizePointers to handle ptr mergepoints with different ptr promotability
    • TileLang: Feature requests for atomic ops on shared memory and injecting Tcgen05Fence passes.
  • Metrics: Triton: 254 PRs 27 Issues (237 PRs / 14 Issues closed). TileLang: 85 PRs 32 Issues (76 PRs / 25 Issues closed).