πŸ“… Engineering Report (2025-12-01 - 2025-12-31)

πŸš€ Executive Summary

December 2025 was a massive month for AI engineering ecosystems, heavily driven by the race to optimize and train the next generation of frontier architectures (DeepSeek-V3/R1 and Qwen3) alongside rapid advancements in quantization (MXFP8, NVFP4). Maintenance health across the board is remarkably strong; major repositories (like PyTorch, XLA, SGLang, and Primus) showed excellent PR closure rates (e.g., Primus closing 143/149 PRs, XLA closing 1143/1213 PRs), indicating robust CI/CD and active maintainer communities capable of keeping up with frantic innovation cycles.

There was a massive coordinated push across PyTorch, TransformerEngine, and SGLang to support complex MoE (Mixture of Experts) paradigms, fully sync-free operations, and overlapping mechanisms (Pipeline Parallelism/Expert Parallelism).

  • DeepSeek-V3 and Qwen3 on MI300X/MI355X: AMD-AGI/Primus shipped back-to-back major releases (v0.5.0 & v0.6.0) that directly bring DeepSeek-V3 (16B & 671B) and Qwen3 configurations to MI300X and MI355X hardware.
  • MoE & Memory Optimizations: Primus successfully implemented fully sync-free MoE stage 3 and Megatron’s All-to-All/DeepEP overlap in pipelines, heavily reducing communication overheads for massive MoE models.
  • Ecosystem Integrations: TorchTitan added an automated ROCm workflow on cron schedule (Main branch). TileLang integrated FlashAttention-2 forward passes specifically for AMD MI300X and updated its CI to ROCm 7.1.
  • New AMD Tools: AMD officially open-sourced the IRLens tool inside Primus to assist with nested sub-computations and intermediate representation profiling.

Competitive Analysis

  • NVIDIA’s Heavy Push for NVFP4/GB200: NVIDIA’s Megatron-LM and TransformerEngine saw synchronized releases focusing heavily on NVFP4 (Zero Padding for MoE, GroupedLinear recipes) and expanding CUDA Graphs to cover quantized weights with Tensor Parallelism.
  • PyTorch AO + Blackwell/Crusoe Benchmarks: The PyTorch/AO v0.15.0 release proudly highlighted MXFP8 MoE training on a 64-node GB200 Crusoe cluster, citing a 1.2x end-to-end training speedup over BF16 with identical convergence. This sets a very high performance bar for MXFP8 implementations moving forward.
  • SGLang Dominance in Serving: SGLang pushed three massive releases, establishing an industry-first WASM Middleware support, a Unified Inference Gateway Mode (IGW), and immediate optimization for DeepSeek V3.2. Their rapid adoption of new architectures makes them a formidable benchmark target for AMD’s internal serving stacks.

πŸ“‚ Category Updates

AMD Ecosystem

AMD-AGI/Primus

  • Key Activity:
    • [2025-12-04] 🚨 RELEASE: v0.5.0
    • [2025-12-20] 🚨 RELEASE: v0.6.0 (Building Docker v25.11)
  • Details:
    • Model Enablement: Added DeepSeek-V3 (16B & 671B), Qwen3 (0.6B/1.7B/32B), and Llama 3.3 configs for MI300X and MI355X architectures.
    • MoE & Communications: Integrated Megatron’s A2A and DeepEP overlap in pipeline; supported fully sync-free MoE stage 3.
    • Turbo Integration: Megatron backend now uses PrimusTurboSpecProvider, added Turbo RMSNorm patches, and Turbo FP8 grouped GEMM.
    • Tooling: Open-sourced the IRLens tool. Re-architected the Runner CLI with patch execution systems and dynamic model parameter overrides.
  • Metrics: 149 PRs 1 Issues (Very healthy maintenance, 143 PRs closed)

ROCm/ROCm & ROCm/MAD

  • Key Activity:
    • [2025-12-01 to 12-31] Heavy bug squashing and documentation updates for xDiT.
  • Details:
    • Investigating Memory Access Faults on gfx1151 (Strix Halo) and HSA Queue Creation failures on ARM64 RDNA3 GPUs.
    • MAD repo merged PyTorch/Megatron-LM training v25.11 and multi-node support features.
  • Metrics: ROCm: 65 PRs 30 Issues. MAD: 11 PRs 1 Issues.

AMD-AGI/TraceLens

  • Key Activity:
    • [2025-12-23] Added Rocprofv3 profile data support.
  • Details:
    • Enabled GPU-only trace support in perf report generation. Tracking feature requests for direct Sharepoint file loading.
  • Metrics: 8 PRs 2 Issues

PyTorch Ecosystem

pytorch/torchtitan

  • Key Activity:
    • [2025-12-26] 🚨 RELEASE: v0.2.1
  • Details:
    • Features: Re-wrote parallel_dims using DeviceMesh unflatten. Integrated DeepEP for advanced MoE routing.
    • Models: Added Context Parallelism to Flux model training and GPT-OSS enablement.
    • ROCm: Added automated ROCm workflow on cron schedule.
  • Metrics: 76 PRs 20 Issues

pytorch/ao

  • Key Activity:
    • [2025-12-22] 🚨 RELEASE: v0.15.0
  • Details:
    • Performance: Highlighted MXFP8 MoE training on GB200 clusters yielding a 1.2x speedup over BF16.
    • Features: Safetensors enablement for torchao models (integrated with HF and vLLM). Allowed FQN (Fully Qualified Name) specific parameter quantization via FqnToConfig for complex MoE models.
    • BC Breaking: Cleaned up and standardized config names (e.g., int4_weight_only -> Int4WeightOnlyConfig).
  • Metrics: 0 PRs (Tracked externally) 18 Issues

pytorch/pytorch

  • Key Activity:
    • [2025-12-01 to 12-31] Ongoing core maintenance.
  • Details:
    • Merged ROCm fixes for AMDSMI return types. Tracking Inductor Aot-Autograd fusion bugs and Varlen attention window exposure.
  • Metrics: 1811 PRs 500 Issues (Healthy: 1732 closed PRs)

meta-pytorch/monarch

  • Key Activity:
    • [2025-12-22] 🚨 RELEASE: v0.2.0
  • Details:
    • Focused on correctness, robustness, and K8s readiness. Strict supervision hierarchy enforced, HostMesh lifecycle control refined, and legacy v0 completely removed.

NVIDIA Ecosystem

NVIDIA/Megatron-LM

  • Key Activity:
    • [2025-12-17] 🚨 RELEASE: core_v0.15.0
  • Details:
    • Performance: Fused QKV preprocessing with precomputed RoPE caches yielding 10-14% E2E speedup. Added CPU activation offloading via TE.
    • MoE: DTensor support for EP and DSv3 modules. Implemented NVFP4 Zero Padding for MoE.
    • FSDP: Enabled joint training of parallel modules.
    • Inference: Added CUDA Graph runner lookup table cache (up to 2x E2E speedup).

NVIDIA/TransformerEngine

  • Key Activity:
    • [2025-12-11] 🚨 RELEASE: v2.10
  • Details:
    • PyTorch: Added NVFP4 training recipe for GroupedLinear. CUDA graphs now supported for quantized weights + Tensor Parallelism. Added SWA (Sliding Window Attention) with Context Parallelism.
    • JAX: Added support for concurrent Data Parallelism (DP) and Fully-Sharded Data Parallelism (FSDP).

JAX / XLA Ecosystem

jax-ml/jax

  • Key Activity:
    • [2025-12-18] 🚨 RELEASE: jax-v0.8.2
  • Details:
    • Deprecated jax.lax.pvary and all symbols in jax.interpreters.pxla. Tracer objects no longer inherit from jax.Array at runtime.

AI-Hypercomputer/maxtext

  • Key Activity:
    • [2025-12-12] 🚨 RELEASE: maxtext-tutorial-v1.4.0
    • [2025-12-30] 🚨 RELEASE: maxtext-tutorial-v1.5.0
  • Details:
    • Added Muon optimizer integration. Merged packing support for Context Parallelism (Ring Attention).
  • Metrics: 131 PRs 12 Issues (Healthy)

openxla/xla

  • Key Activity:
    • [2025-12-01 to 12-31] High volume optimization and automated refactoring.
  • Details:
    • Tracking a severe segmentation fault in HLO HTML dumper due to stale fusion states.
  • Metrics: 1213 PRs 16 Issues (Highly active: 1143 closed PRs)

Serving, Inference & RLHF Ecosystem

sgl-project/sglang

  • Key Activity:
    • [2025-12-03] 🚨 RELEASE: v0.5.6
    • [2025-12-10] 🚨 RELEASE: gateway-v0.2.4
    • [2025-12-24] 🚨 RELEASE: gateway-v0.3.0
  • Details:
    • Gateway v0.3.0: Huge architectural shift. Released Unified Inference Gateway Mode (IGW) to handle entire fleets from a single gateway. Added UUID-based Worker Resource Management and completely redesigned the 6-layer metrics architecture.
    • Gateway v0.2.4: Industry-first WASM Middleware support. Full OpenTelemetry integration.
    • Engine v0.5.6: Zero-day support for DeepSeek V3.2 / Speciale. Integrated JIT kernels. Support for new blockwise diffusion language models (Flux2, Z-image).
  • Metrics: Massive volume of PR merges (~300+ merged for releases)

THUDM/slime

  • Key Activity:
    • [2025-12-01] 🚨 RELEASE: v0.2.0.post1
    • [2025-12-12] 🚨 RELEASE: v0.2.1
  • Details:
    • Added true on-policy training on Qwen3-VL (dense) using VLM + FSDP.
    • Integrated PD-disaggregation support during rollout and DP-attention support in rollout routing replay (R3). Upgraded backend dependency to SGLang v0.5.6.

xdit-project/xDiT

  • Key Activity:
    • [2025-12-01 to 12-31] Model and backend expansion.
  • Details:
    • Added support for HunyuanVideo-1.5 and Z-Image Turbo. Integrated custom attention backend support.
  • Metrics: 7 PRs 3 Issues

Compilers & Low-Level Optimizations

tile-ai/tilelang

  • Key Activity:
    • [2025-12-07] 🚨 RELEASE: v0.1.7
    • [2025-12-24] 🚨 RELEASE: v0.1.7.post1
    • [2025-12-31] 🚨 RELEASE: v0.1.7.post2
  • Details:
    • Hardware Adapters: Enabled FlashAttention-2 forward on AMD MI300X, fixed MLA autotune for ROCm, updated CI to ROCm-7.1, and introduced Huawei Ascend support.
    • Core Enhancements: Added CuTeDSL backend, integrated Z3 in TVM Arith Analyzer, added JIT lazy execution, and supported FP8 to FP32 vectorized casts.
  • Metrics: 173 PRs 55 Issues (Healthy: 166 PRs closed)

triton-lang/triton

  • Key Activity:
    • [2025-12-01 to 12-31] General compiler passes and hardware specific lowering.
  • Details:
    • AMD WarpPipeliner received support for single block execute regions in UpdateAsyncWaitCount. Tracking frontend bugs producing invalid IR with SSA values in constant-range loops.
  • Metrics: 238 PRs 33 Issues (Healthy: 232 PRs closed)

deepspeedai/DeepSpeed

  • Key Activity:
    • [2025-12-09] 🚨 RELEASE: v0.18.3
  • Details:
    • Added Muon optimizer support (allowing separate β€œmuon_lr” and β€œadam_lr”).
    • Relaxed tolerances for FP8 unit tests specifically for ROCm (FP16 and BF16 cases). Added Qwen2.5 to the AutoTP model list.