revert: 移除 torch.compile(动态 batch 形状导致反复重编译,反而慢于不编译)

Sequence Packing 使每个 batch 序列长度不同,CUDA Graph 需反复重编译。
Flash Attention + FP16 是目前最优组合(94.5s, 56.98 分)。
This commit is contained in:
2026-06-12 22:02:40 +08:00
parent c5fee2da9b
commit bc6e8307c5
-5
View File
@@ -511,11 +511,6 @@ def load_model(ckpt_path, device='cuda:0'):
model.to(dev) model.to(dev)
model.eval() model.eval()
# === torch.compile:算子融合 + 减少 kernel launch 开销 ===
model = torch.compile(model, mode="reduce-overhead")
print("[INFO] torch.compile applied (mode=reduce-overhead)")
print(f"[INFO] Model ready. Device: {dev}") print(f"[INFO] Model ready. Device: {dev}")
return model, dev return model, dev