revert: 移除 torch.compile(动态 batch 形状导致反复重编译,反而慢于不编译)
Sequence Packing 使每个 batch 序列长度不同,CUDA Graph 需反复重编译。 Flash Attention + FP16 是目前最优组合(94.5s, 56.98 分)。
This commit is contained in:
@@ -511,11 +511,6 @@ def load_model(ckpt_path, device='cuda:0'):
|
|||||||
|
|
||||||
model.to(dev)
|
model.to(dev)
|
||||||
model.eval()
|
model.eval()
|
||||||
|
|
||||||
# === torch.compile:算子融合 + 减少 kernel launch 开销 ===
|
|
||||||
model = torch.compile(model, mode="reduce-overhead")
|
|
||||||
print("[INFO] torch.compile applied (mode=reduce-overhead)")
|
|
||||||
|
|
||||||
print(f"[INFO] Model ready. Device: {dev}")
|
print(f"[INFO] Model ready. Device: {dev}")
|
||||||
|
|
||||||
return model, dev
|
return model, dev
|
||||||
|
|||||||
Reference in New Issue
Block a user