feat: Flash Attention + torch.compile(第二版优化方案)

- scaled_dot_product 替换为 F.scaled_dot_product_attention(自动启用 Flash Attention)
- load_model 中添加 torch.compile(mode='reduce-overhead')
- build_env.sh: 预热 torch inductor,避免编译耗时计入推理
This commit is contained in:
2026-06-12 21:39:43 +08:00
parent 97c4cc84a0
commit 574399e8ac
2 changed files with 28 additions and 9 deletions
+12 -2
View File
@@ -1,7 +1,17 @@
#!/bin/bash
set -e
# 安装 Python 依赖(评测系统使用阿里云 PyPI 镜像)
pip install -r requirements.txt
# 预热 torch inductor,避免推理时编译
python -c "
import torch
@torch.compile(mode='reduce-overhead')
def _warmup(x):
return x * 2
x = torch.randn(100, 100, device='cuda')
_warmup(x)
print('Inductor cache ready')
"
echo "build env success"