📅 Engineering Report (2025-12-01 - 2025-12-31)

🚀 Executive Summary

December 2025 was a massive month for AI engineering ecosystems, heavily driven by the race to optimize and train the next generation of frontier architectures (DeepSeek-V3/R1 and Qwen3) alongside rapid advancements in quantization (MXFP8, NVFP4). Maintenance health across the board is remarkably strong; major repositories (like PyTorch, XLA, SGLang, and Primus) showed excellent PR closure rates (e.g., Primus closing 143/149 PRs, XLA closing 1143/1213 PRs), indicating robust CI/CD and active maintainer communities capable of keeping up with frantic innovation cycles.

There was a massive coordinated push across PyTorch, TransformerEngine, and SGLang to support complex MoE (Mixture of Experts) paradigms, fully sync-free operations, and overlapping mechanisms (Pipeline Parallelism/Expert Parallelism).

DeepSeek-V3 and Qwen3 on MI300X/MI355X: AMD-AGI/Primus shipped back-to-back major releases (v0.5.0 & v0.6.0) that directly bring DeepSeek-V3 (16B & 671B) and Qwen3 configurations to MI300X and MI355X hardware.
MoE & Memory Optimizations: Primus successfully implemented fully sync-free MoE stage 3 and Megatron’s All-to-All/DeepEP overlap in pipelines, heavily reducing communication overheads for massive MoE models.
Ecosystem Integrations: TorchTitan added an automated ROCm workflow on cron schedule (Main branch). TileLang integrated FlashAttention-2 forward passes specifically for AMD MI300X and updated its CI to ROCm 7.1.
New AMD Tools: AMD officially open-sourced the IRLens tool inside Primus to assist with nested sub-computations and intermediate representation profiling.

Competitive Analysis

NVIDIA’s Heavy Push for NVFP4/GB200: NVIDIA’s Megatron-LM and TransformerEngine saw synchronized releases focusing heavily on NVFP4 (Zero Padding for MoE, GroupedLinear recipes) and expanding CUDA Graphs to cover quantized weights with Tensor Parallelism.
PyTorch AO + Blackwell/Crusoe Benchmarks: The PyTorch/AO v0.15.0 release proudly highlighted MXFP8 MoE training on a 64-node GB200 Crusoe cluster, citing a 1.2x end-to-end training speedup over BF16 with identical convergence. This sets a very high performance bar for MXFP8 implementations moving forward.
SGLang Dominance in Serving: SGLang pushed three massive releases, establishing an industry-first WASM Middleware support, a Unified Inference Gateway Mode (IGW), and immediate optimization for DeepSeek V3.2. Their rapid adoption of new architectures makes them a formidable benchmark target for AMD’s internal serving stacks.

📂 Category Updates

AMD Ecosystem

AMD-AGI/Primus

Key Activity:
- [2025-12-04] 🚨 RELEASE: v0.5.0
- [2025-12-20] 🚨 RELEASE: v0.6.0 (Building Docker v25.11)
Details:
- Model Enablement: Added DeepSeek-V3 (16B & 671B), Qwen3 (0.6B/1.7B/32B), and Llama 3.3 configs for MI300X and MI355X architectures.
- MoE & Communications: Integrated Megatron’s A2A and DeepEP overlap in pipeline; supported fully sync-free MoE stage 3.
- Turbo Integration: Megatron backend now uses PrimusTurboSpecProvider, added Turbo RMSNorm patches, and Turbo FP8 grouped GEMM.
- Tooling: Open-sourced the IRLens tool. Re-architected the Runner CLI with patch execution systems and dynamic model parameter overrides.
Metrics: 149 PRs 1 Issues (Very healthy maintenance, 143 PRs closed)

ROCm/ROCm & ROCm/MAD

Key Activity:
- [2025-12-01 to 12-31] Heavy bug squashing and documentation updates for xDiT.
Details:
- Investigating Memory Access Faults on gfx1151 (Strix Halo) and HSA Queue Creation failures on ARM64 RDNA3 GPUs.
- MAD repo merged PyTorch/Megatron-LM training v25.11 and multi-node support features.
Metrics: ROCm: 65 PRs 30 Issues. MAD: 11 PRs 1 Issues.

AMD-AGI/TraceLens

Key Activity:
- [2025-12-23] Added Rocprofv3 profile data support.
Details:
- Enabled GPU-only trace support in perf report generation. Tracking feature requests for direct Sharepoint file loading.
Metrics: 8 PRs 2 Issues

PyTorch Ecosystem

pytorch/torchtitan

Key Activity:
- [2025-12-26] 🚨 RELEASE: v0.2.1
Details:
- Features: Re-wrote parallel_dims using DeviceMesh unflatten. Integrated DeepEP for advanced MoE routing.
- Models: Added Context Parallelism to Flux model training and GPT-OSS enablement.
- ROCm: Added automated ROCm workflow on cron schedule.
Metrics: 76 PRs 20 Issues

pytorch/ao

Key Activity:
- [2025-12-22] 🚨 RELEASE: v0.15.0
Details:
- Performance: Highlighted MXFP8 MoE training on GB200 clusters yielding a 1.2x speedup over BF16.
- Features: Safetensors enablement for torchao models (integrated with HF and vLLM). Allowed FQN (Fully Qualified Name) specific parameter quantization via FqnToConfig for complex MoE models.
- BC Breaking: Cleaned up and standardized config names (e.g., int4_weight_only -> Int4WeightOnlyConfig).
Metrics: 0 PRs (Tracked externally) 18 Issues

pytorch/pytorch

Key Activity:
- [2025-12-01 to 12-31] Ongoing core maintenance.
Details:
- Merged ROCm fixes for AMDSMI return types. Tracking Inductor Aot-Autograd fusion bugs and Varlen attention window exposure.
Metrics: 1811 PRs 500 Issues (Healthy: 1732 closed PRs)

meta-pytorch/monarch

Key Activity:
- [2025-12-22] 🚨 RELEASE: v0.2.0
Details:
- Focused on correctness, robustness, and K8s readiness. Strict supervision hierarchy enforced, HostMesh lifecycle control refined, and legacy v0 completely removed.

NVIDIA Ecosystem

NVIDIA/Megatron-LM

Key Activity:
- [2025-12-17] 🚨 RELEASE: core_v0.15.0
Details:
- Performance: Fused QKV preprocessing with precomputed RoPE caches yielding 10-14% E2E speedup. Added CPU activation offloading via TE.
- MoE: DTensor support for EP and DSv3 modules. Implemented NVFP4 Zero Padding for MoE.
- FSDP: Enabled joint training of parallel modules.
- Inference: Added CUDA Graph runner lookup table cache (up to 2x E2E speedup).

NVIDIA/TransformerEngine

Key Activity:
- [2025-12-11] 🚨 RELEASE: v2.10
Details:
- PyTorch: Added NVFP4 training recipe for GroupedLinear. CUDA graphs now supported for quantized weights + Tensor Parallelism. Added SWA (Sliding Window Attention) with Context Parallelism.
- JAX: Added support for concurrent Data Parallelism (DP) and Fully-Sharded Data Parallelism (FSDP).

JAX / XLA Ecosystem

jax-ml/jax

Key Activity:
- [2025-12-18] 🚨 RELEASE: jax-v0.8.2
Details:
- Deprecated jax.lax.pvary and all symbols in jax.interpreters.pxla. Tracer objects no longer inherit from jax.Array at runtime.

AI-Hypercomputer/maxtext

Key Activity:
- [2025-12-12] 🚨 RELEASE: maxtext-tutorial-v1.4.0
- [2025-12-30] 🚨 RELEASE: maxtext-tutorial-v1.5.0
Details:
- Added Muon optimizer integration. Merged packing support for Context Parallelism (Ring Attention).
Metrics: 131 PRs 12 Issues (Healthy)

openxla/xla

Key Activity:
- [2025-12-01 to 12-31] High volume optimization and automated refactoring.
Details:
- Tracking a severe segmentation fault in HLO HTML dumper due to stale fusion states.
Metrics: 1213 PRs 16 Issues (Highly active: 1143 closed PRs)

Serving, Inference & RLHF Ecosystem

sgl-project/sglang

Key Activity:
- [2025-12-03] 🚨 RELEASE: v0.5.6
- [2025-12-10] 🚨 RELEASE: gateway-v0.2.4
- [2025-12-24] 🚨 RELEASE: gateway-v0.3.0
Details:
- Gateway v0.3.0: Huge architectural shift. Released Unified Inference Gateway Mode (IGW) to handle entire fleets from a single gateway. Added UUID-based Worker Resource Management and completely redesigned the 6-layer metrics architecture.
- Gateway v0.2.4: Industry-first WASM Middleware support. Full OpenTelemetry integration.
- Engine v0.5.6: Zero-day support for DeepSeek V3.2 / Speciale. Integrated JIT kernels. Support for new blockwise diffusion language models (Flux2, Z-image).
Metrics: Massive volume of PR merges (~300+ merged for releases)

THUDM/slime

Key Activity:
- [2025-12-01] 🚨 RELEASE: v0.2.0.post1
- [2025-12-12] 🚨 RELEASE: v0.2.1
Details:
- Added true on-policy training on Qwen3-VL (dense) using VLM + FSDP.
- Integrated PD-disaggregation support during rollout and DP-attention support in rollout routing replay (R3). Upgraded backend dependency to SGLang v0.5.6.

xdit-project/xDiT

Key Activity:
- [2025-12-01 to 12-31] Model and backend expansion.
Details:
- Added support for HunyuanVideo-1.5 and Z-Image Turbo. Integrated custom attention backend support.
Metrics: 7 PRs 3 Issues

Compilers & Low-Level Optimizations

tile-ai/tilelang

Key Activity:
- [2025-12-07] 🚨 RELEASE: v0.1.7
- [2025-12-24] 🚨 RELEASE: v0.1.7.post1
- [2025-12-31] 🚨 RELEASE: v0.1.7.post2
Details:
- Hardware Adapters: Enabled FlashAttention-2 forward on AMD MI300X, fixed MLA autotune for ROCm, updated CI to ROCm-7.1, and introduced Huawei Ascend support.
- Core Enhancements: Added CuTeDSL backend, integrated Z3 in TVM Arith Analyzer, added JIT lazy execution, and supported FP8 to FP32 vectorized casts.
Metrics: 173 PRs 55 Issues (Healthy: 166 PRs closed)

triton-lang/triton

Key Activity:
- [2025-12-01 to 12-31] General compiler passes and hardware specific lowering.
Details:
- AMD WarpPipeliner received support for single block execute regions in UpdateAsyncWaitCount. Tracking frontend bugs producing invalid IR with SSA values in constant-range loops.
Metrics: 238 PRs 33 Issues (Healthy: 232 PRs closed)

deepspeedai/DeepSpeed

Key Activity:
- [2025-12-09] 🚨 RELEASE: v0.18.3
Details:
- Added Muon optimizer support (allowing separate “muon_lr” and “adam_lr”).
- Relaxed tolerances for FP8 unit tests specifically for ROCm (FP16 and BF16 cases). Added Qwen2.5 to the AutoTP model list.

GitHub Monthly Report: 2025-12-01 to 2025-12-31

📅 Engineering Report (2025-12-01 - 2025-12-31)

🚀 Executive Summary

Competitive Analysis

📂 Category Updates

AMD Ecosystem

AMD-AGI/Primus

ROCm/ROCm & ROCm/MAD

AMD-AGI/TraceLens

PyTorch Ecosystem

pytorch/torchtitan

pytorch/ao

pytorch/pytorch

meta-pytorch/monarch

NVIDIA Ecosystem

NVIDIA/Megatron-LM

NVIDIA/TransformerEngine

JAX / XLA Ecosystem

jax-ml/jax

AI-Hypercomputer/maxtext

openxla/xla

Serving, Inference & RLHF Ecosystem

sgl-project/sglang

THUDM/slime

xdit-project/xDiT

Compilers & Low-Level Optimizations

tile-ai/tilelang

triton-lang/triton

deepspeedai/DeepSpeed

🔗 References

📅 Engineering Report (2025-12-01 - 2025-12-31)

🚀 Executive Summary

AMD Related Updates

Competitive Analysis

📂 Category Updates

AMD Ecosystem

AMD-AGI/Primus

ROCm/ROCm & ROCm/MAD

AMD-AGI/TraceLens

PyTorch Ecosystem

pytorch/torchtitan

pytorch/ao

pytorch/pytorch

meta-pytorch/monarch

NVIDIA Ecosystem

NVIDIA/Megatron-LM

NVIDIA/TransformerEngine

JAX / XLA Ecosystem

jax-ml/jax

AI-Hypercomputer/maxtext

openxla/xla

Serving, Inference & RLHF Ecosystem

sgl-project/sglang

THUDM/slime

xdit-project/xDiT

Compilers & Low-Level Optimizations

tile-ai/tilelang

triton-lang/triton

deepspeedai/DeepSpeed

🔗 References