feat: 接口对齐 + FP16 量化(第一版优化方案)
- CTRUserDataset → CTRTestSeqDataset,构造参数对齐评测接口 - load_model 签名修正:ckpt_path 作为第一参数 - FP16 量化:model.half() + Embedding 保留 FP32 - move_batch_to_device 自动 FP32→FP16 转换 - 缓存时预转 FP16,减少推理循环开销 - requirements.txt 精简(去除 nvidia-* 包) - build_env.sh 标准化(set -e + pip install) - CLAUDE.md 更新开发命令、代码架构、关键接口说明
This commit is contained in:
@@ -1,64 +1,133 @@
|
|||||||
# 百度商业AI技术创新大赛 — 生成式推荐广告排序推理性能优化
|
# CLAUDE.md
|
||||||
|
|
||||||
## 比赛信息
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||||
|
|
||||||
- **全称**: 百度商业AI技术创新大赛 (CTI) 2026
|
## 项目概述
|
||||||
- **赛题**: 生成式推荐广告排序推理性能优化
|
|
||||||
- **主办**: 百度商业 / 百度飞桨 / NVIDIA 技术合作
|
|
||||||
- **平台**: [AI Studio](https://aistudio.baidu.com/competition/detail/1461)
|
|
||||||
- **大赛官网**: http://cti.baidu.com
|
|
||||||
- **奖池**: ¥19W(含 NV-DGX-Spark)
|
|
||||||
- **报名截止**: 2026/06/26 11:59:59
|
|
||||||
- **夏令营决赛**: 2026年7月(4天3晚,包交通食宿)
|
|
||||||
|
|
||||||
## 赛题核心
|
百度商业AI技术创新大赛 (CTI) 2026 — **生成式推荐广告排序推理性能优化**。
|
||||||
|
|
||||||
给定基于 Transformer 的生成式推荐广告排序模型(GRAB),在**不改变模型结构、不在测试集上训练**的前提下,极致优化推理性能。
|
目标:给定 GRAB Transformer 模型,在**不改模型结构、不在测试集训练**的前提下,极致优化推理性能。量化/稀疏/剪枝明确允许。
|
||||||
|
|
||||||
### 双门槛评分
|
## 环境与常用命令
|
||||||
|
|
||||||
| 维度 | 要求 | 不达标后果 |
|
```powershell
|
||||||
|------|------|------------|
|
# 激活虚拟环境
|
||||||
| 推理效率 | 纯推理 ≤ 5min,环境构建 ≤ 20min | 总分 0 |
|
.\.venv\Scripts\Activate.ps1
|
||||||
| 策略效果 | AUC ≥ 0.65,PCOC ∈ [0.85, 1.15] | 总分 0 |
|
|
||||||
|
|
||||||
### 提交格式
|
# 本地运行推理(需要 dataset/ 和 ckpt.pt)
|
||||||
|
.\.venv\Scripts\python.exe 代码\code\infer.py
|
||||||
|
.\.venv\Scripts\python.exe 代码\code\infer.py --ckpt path/to/ckpt.pt
|
||||||
|
|
||||||
`xxx.zip` 包含:
|
# AI Studio SDK(下载数据集、提交)
|
||||||
- `infer.py` — 推理入口脚本
|
.\.venv\Scripts\aistudio.exe download --dataset <id> --local_dir ./dataset --token <token>
|
||||||
- `build_env.sh` — 环境构建脚本
|
.\.venv\Scripts\aistudio.exe download --model <id> --local_dir . --token <token>
|
||||||
- `requirements.txt` — Python 依赖
|
|
||||||
- 可选:打包的 Python 环境、量化后的模型文件等
|
|
||||||
|
|
||||||
**注意**:不要包含数据集文件夹,不要修改模型权重参数
|
# 打包提交
|
||||||
|
cd 代码/code && zip -r ../../submit.zip infer.py requirements.txt build_env.sh
|
||||||
|
```
|
||||||
|
|
||||||
### 约束
|
本地环境仅装 `numpy` + `tqdm` + `aistudio-sdk`(轻量),完整 PyTorch 依赖见 `代码/code/requirements.txt`,训练/推理在服务端跑。
|
||||||
|
|
||||||
- 组网不可进行策略性改动
|
## 代码架构
|
||||||
- 不可对测试集进行训练
|
|
||||||
|
```
|
||||||
|
infer.py (单文件,~730 行,所有逻辑集中于此)
|
||||||
|
├── 数据加载层
|
||||||
|
│ ├── _detect_has_clk() — 检测 CSV 是否有 clk 列
|
||||||
|
│ ├── load_sample_files() — 加载 CSV → item_dict + user_seq
|
||||||
|
│ ├── load_logids_from_file() — 快速提取文件中所有 logid
|
||||||
|
│ └── CTRUserDataset(Dataset) — 按用户组织的 CTR 数据集
|
||||||
|
│ └── make_collate_fn() — 将用户样本拼接为 batch(含 slot 特征展开)
|
||||||
|
├── 模型层
|
||||||
|
│ ├── RepEncoder — Slot-wise Embedding → LayerNorm → Linear
|
||||||
|
│ │ └── Embedding(5M vocab, 512d) × 28 slots → segment_reduce(sum) → concat
|
||||||
|
│ ├── TransformerEncoder (8 层)
|
||||||
|
│ │ ├── QKV Projection → Multi-Head Attention (scaled_dot_product)
|
||||||
|
│ │ ├── SMoE FFN(8 experts, Top-2 gating, 每层独立)
|
||||||
|
│ │ └── Pre-LayerNorm + Residual
|
||||||
|
│ ├── CTRModel — RepEncoder + Transformer → Linear → logit
|
||||||
|
│ │ └── Causal mask: 同一用户的 tokens 因果遮罩,不同用户隔离
|
||||||
|
│ └── load_model(ckpt_path, device) — 模型构建 + 权重加载入口
|
||||||
|
├── 推理循环 (main)
|
||||||
|
│ ├── 数据加载(优先缓存 shard_*.pt)
|
||||||
|
│ ├── 逐 batch 推理 + 计时(只计 model(batch) 耗时)
|
||||||
|
│ └── 按 test.csv 顺序写 predict.txt
|
||||||
|
└── 打分工具
|
||||||
|
└── _cal_score() — AUC + PCOC + latency → score_all
|
||||||
|
```
|
||||||
|
|
||||||
|
**模型参数规模**:Embedding 5M×512 + 8 层 Transformer (d_model=512, n_heads=8, dim_ff=1024) × MoE(8 experts) ≈ ~6.5M~11.3M 参数。
|
||||||
|
|
||||||
|
## 关键接口(评测系统调用契约)
|
||||||
|
|
||||||
|
评测系统通过 `from infer import ...` 加载代码,以下是**必须**对齐的接口(来自 `代码/任务提交接口说明.md`):
|
||||||
|
|
||||||
|
| 接口 | 签名 | 说明 |
|
||||||
|
|------|------|------|
|
||||||
|
| `load_sample_files` | `(sample_files_list: List[Path]) -> (item_dict, user_seq)` | 数据加载 |
|
||||||
|
| `CTRTestSeqDataset` | `(test_logids_ordered, item_dict, user_seq, max_feasign_per_slot, max_ctx_len)` | **必须有 `max_slot_id` 属性** |
|
||||||
|
| `make_collate_fn` | `(max_slot_id) -> Callable` | DataLoader 的 collate_fn |
|
||||||
|
| `load_model` | `(ckpt_path: Path) -> (model, device)` | 第一个参数是 Path |
|
||||||
|
| `move_batch_to_device` | `(batch, device) -> batch` | |
|
||||||
|
| `model(batch)` | `-> (logits, moe_loss)` | logits 经 sigmoid 后是点击概率 |
|
||||||
|
|
||||||
|
**致命不匹配**(baseline `infer.py` 当前存在,提交前必须修复):
|
||||||
|
1. 类名 `CTRUserDataset` → 应为 `CTRTestSeqDataset`
|
||||||
|
2. 构造参数 `pred_logids` → 应为 `test_logids_ordered`,缺少 `max_ctx_len`
|
||||||
|
3. `load_model(device='cuda:0', ckpt_path=None)` → 应为 `load_model(ckpt_path, device='cuda:0')`(Path 作为第一参数)
|
||||||
|
|
||||||
|
## 提交规范
|
||||||
|
|
||||||
|
### 压缩包结构
|
||||||
|
```
|
||||||
|
submit.zip
|
||||||
|
├── infer.py # 必需,实现上述全部接口
|
||||||
|
├── requirements.txt # 可选,阿里云 PyPI 镜像安装
|
||||||
|
└── build_env.sh # 可选,超时 720s,非 0 退出即失败
|
||||||
|
```
|
||||||
|
|
||||||
|
### 硬约束(任一违反 → 总分 0)
|
||||||
|
- 推理耗时 < 300s(只计 `model(batch)` 逐 batch 累加)
|
||||||
|
- AUC ∈ [0.65, 1.0],PCOC ∈ [0.85, 1.15]
|
||||||
|
- 压缩包内**不能**有 `dataset/` 或 `ckpt.pt`
|
||||||
|
- 包后缀只能是 `.zip`/`.tar.gz`/`.tar`,解压后文件在根目录
|
||||||
- 每天最多提交 10 次
|
- 每天最多提交 10 次
|
||||||
|
|
||||||
## 技术背景
|
### 总分公式
|
||||||
|
```
|
||||||
|
score_latency = max(0, (300 - latency) / 300)
|
||||||
|
score_model = ((AUC - 0.65) * 1000 + (0.15 - |PCOC - 1|) / 0.15 * 10) / 360
|
||||||
|
score_all = score_latency * 70 + score_model * 30
|
||||||
|
```
|
||||||
|
|
||||||
基于两篇核心论文:
|
## 优化路线图(来自 `推理优化方案.md`)
|
||||||
|
|
||||||
1. **GRAB** (百度, 2026) — 比赛 baseline 模型
|
Baseline 数据:推理 229s,AUC 0.759,PCOC 1.110,得分 25.85。
|
||||||
- arXiv: 2602.01865
|
|
||||||
- 核心:CamA 多通道注意力 + STS 两阶段训练
|
|
||||||
- 模型规模:~6.5M~11.3M 参数
|
|
||||||
|
|
||||||
2. **HSTU** (Meta, 2024) — GRAB 的架构基础
|
1. **接口对齐**(必须先做)— 确认能在评测系统跑通(得分 > 0)
|
||||||
- arXiv: 2402.17152 (ICML 2024)
|
2. **FP16 量化** — `model.half()`,Embedding 保留 FP32,预期 229s → ~120s
|
||||||
- 核心:Pointwise Aggregated Attention + 算子融合
|
3. **Flash Attention** — 替换 `scaled_dot_product` 为 `F.scaled_dot_product_attention`,数学等价
|
||||||
- 比 FlashAttention2 Transformer 快 5.3~15.2 倍
|
4. **torch.compile** — `mode="reduce-overhead"` → `"max-autotune"`,build_env.sh 中预热
|
||||||
|
5. **数据流优化** — 缓存时预转 FP16 + 预搬到 GPU
|
||||||
|
6. **MoE 优化** — 统计 expert 负载,合并/移除低频 expert
|
||||||
|
7. **INT8 量化**(可选)— 精度风险较高,仅在前几步不够时尝试
|
||||||
|
|
||||||
## 推理优化方向(按优先级)
|
CUDA Graph 已评估并放弃(batch 形状不固定,不适用)。
|
||||||
|
|
||||||
1. **模型量化** — FP16/INT8,Paddle-TensorRT
|
每步完成后必须在 AI Studio 提交验证,AUC/PCOC 不达标立即回退。
|
||||||
2. **Flash Attention** — 减少注意力显存和计算
|
|
||||||
3. **算子融合** — 减少 kernel launch 开销
|
## 关键文件
|
||||||
4. **序列精简** — 压缩/裁剪冗余历史 token
|
|
||||||
5. **多通道合并** — CamA 通道剪枝或共享
|
| 路径 | 用途 |
|
||||||
|
|------|------|
|
||||||
|
| `代码/code/infer.py` | 推理主脚本(提交的核心文件) |
|
||||||
|
| `代码/code/requirements.txt` | 服务端依赖(torch 2.6.0 + CUDA 12.4) |
|
||||||
|
| `代码/code/build_env.sh` | 环境构建脚本(目前为空壳) |
|
||||||
|
| `代码/任务提交接口说明.md` | 官方接口规范 |
|
||||||
|
| `推理优化方案.md` | 完整优化方案(含合规审查) |
|
||||||
|
| `论文/GRAB_*.pdf` | GRAB 论文(baseline 模型) |
|
||||||
|
| `论文/HSTU_*.pdf` | HSTU 论文(架构基础) |
|
||||||
|
| `.gitignore` | 排除 ckpt.pt, dataset/, *.zip, .venv/ |
|
||||||
|
|
||||||
## 提交记录
|
## 提交记录
|
||||||
|
|
||||||
|
|||||||
@@ -1,4 +1,7 @@
|
|||||||
#!/bin/bash
|
#!/bin/bash
|
||||||
|
set -e
|
||||||
|
|
||||||
|
# 安装 Python 依赖(评测系统使用阿里云 PyPI 镜像)
|
||||||
|
pip install -r requirements.txt
|
||||||
|
|
||||||
echo "build env succeess"
|
echo "build env success"
|
||||||
|
|||||||
+28
-13
@@ -118,15 +118,17 @@ def load_logids_from_file(file_path):
|
|||||||
return logids
|
return logids
|
||||||
|
|
||||||
|
|
||||||
class CTRUserDataset(Dataset):
|
class CTRTestSeqDataset(Dataset):
|
||||||
"""按用户组织的 CTR 数据集"""
|
"""按用户组织的 CTR 测试数据集(对齐评测接口)"""
|
||||||
|
|
||||||
def __init__(self, item_dict, user_seq=None, max_feasign_per_slot=None, pred_logids=None):
|
def __init__(self, test_logids_ordered, item_dict, user_seq=None,
|
||||||
|
max_feasign_per_slot=None, max_ctx_len=None):
|
||||||
super().__init__()
|
super().__init__()
|
||||||
self.item_dict = item_dict
|
self.item_dict = item_dict
|
||||||
self.user_seq = user_seq if user_seq else {}
|
self.user_seq = user_seq if user_seq else {}
|
||||||
self.max_feasign_per_slot = max_feasign_per_slot
|
self.max_feasign_per_slot = max_feasign_per_slot
|
||||||
self.pred_logids = pred_logids if pred_logids is not None else set()
|
self.max_ctx_len = max_ctx_len
|
||||||
|
self.pred_logids = set(test_logids_ordered) if test_logids_ordered else set()
|
||||||
|
|
||||||
self.user_items = defaultdict(list)
|
self.user_items = defaultdict(list)
|
||||||
for logid, rec in item_dict.items():
|
for logid, rec in item_dict.items():
|
||||||
@@ -236,7 +238,11 @@ def move_batch_to_device(batch, device):
|
|||||||
elif isinstance(batch, (list, tuple)):
|
elif isinstance(batch, (list, tuple)):
|
||||||
return [move_batch_to_device(x, device) for x in batch]
|
return [move_batch_to_device(x, device) for x in batch]
|
||||||
elif torch.is_tensor(batch):
|
elif torch.is_tensor(batch):
|
||||||
return batch.to(device)
|
x = batch.to(device)
|
||||||
|
# 浮点 tensor → FP16,整数 tensor 保持不变
|
||||||
|
if x.dtype == torch.float32:
|
||||||
|
x = x.half()
|
||||||
|
return x
|
||||||
else:
|
else:
|
||||||
return batch
|
return batch
|
||||||
|
|
||||||
@@ -443,12 +449,12 @@ class CTRModel(nn.Module):
|
|||||||
# 模型加载入口
|
# 模型加载入口
|
||||||
# ============================================================
|
# ============================================================
|
||||||
|
|
||||||
def load_model(device='cuda:0', ckpt_path=None):
|
def load_model(ckpt_path, device='cuda:0'):
|
||||||
"""加载模型并返回,供 evaluation.py 调用。
|
"""加载模型并返回,供 evaluation.py 调用。
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
|
ckpt_path: checkpoint 文件路径(评测系统传入 Path 对象)
|
||||||
device: 推理设备(默认 'cuda:0')
|
device: 推理设备(默认 'cuda:0')
|
||||||
ckpt_path: checkpoint 文件路径,默认使用 infer.py 同目录下的 ckpt.pt
|
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
(model, device) 元组
|
(model, device) 元组
|
||||||
@@ -490,6 +496,11 @@ def load_model(device='cuda:0', ckpt_path=None):
|
|||||||
ckpt = torch.load(ckpt_path, map_location='cpu', weights_only=False)
|
ckpt = torch.load(ckpt_path, map_location='cpu', weights_only=False)
|
||||||
model.load_state_dict(ckpt['model_state_dict'])
|
model.load_state_dict(ckpt['model_state_dict'])
|
||||||
print(f"[INFO] Loaded checkpoint from {ckpt_path} (epoch={ckpt.get('epoch', '?')})")
|
print(f"[INFO] Loaded checkpoint from {ckpt_path} (epoch={ckpt.get('epoch', '?')})")
|
||||||
|
|
||||||
|
# === FP16 量化:模型参数转半精度,Embedding 保留 FP32 ===
|
||||||
|
model = model.half()
|
||||||
|
model.rep_encoder.emb = model.rep_encoder.emb.to(torch.float32)
|
||||||
|
print("[INFO] Model converted to FP16 (embedding kept in FP32)")
|
||||||
else:
|
else:
|
||||||
print(f"[WARNING] Checkpoint {ckpt_path} not found, using random weights")
|
print(f"[WARNING] Checkpoint {ckpt_path} not found, using random weights")
|
||||||
|
|
||||||
@@ -616,10 +627,11 @@ def main():
|
|||||||
print(f'[INFO] Test pred logids count: {len(test_pred_logids)}')
|
print(f'[INFO] Test pred logids count: {len(test_pred_logids)}')
|
||||||
|
|
||||||
max_feasign_per_slot = {1: 2}
|
max_feasign_per_slot = {1: 2}
|
||||||
test_dataset = CTRUserDataset(
|
test_dataset = CTRTestSeqDataset(
|
||||||
item_dict, user_seq,
|
test_logids_ordered=list(test_pred_logids),
|
||||||
|
item_dict=item_dict,
|
||||||
|
user_seq=user_seq,
|
||||||
max_feasign_per_slot=max_feasign_per_slot,
|
max_feasign_per_slot=max_feasign_per_slot,
|
||||||
pred_logids=test_pred_logids,
|
|
||||||
)
|
)
|
||||||
print(f'[INFO] num_users={test_dataset.num_users}, '
|
print(f'[INFO] num_users={test_dataset.num_users}, '
|
||||||
f'total_samples={test_dataset.total_samples}, '
|
f'total_samples={test_dataset.total_samples}, '
|
||||||
@@ -634,9 +646,12 @@ def main():
|
|||||||
collate_fn=make_collate_fn(test_dataset.max_slot_id),
|
collate_fn=make_collate_fn(test_dataset.max_slot_id),
|
||||||
)
|
)
|
||||||
|
|
||||||
# 收集 batches 并按分片缓存
|
# 收集 batches,预转 FP16 后按分片缓存
|
||||||
print('[INFO] collecting batches and saving sharded cache...')
|
print('[INFO] collecting batches (pre-converting to FP16) and saving sharded cache...')
|
||||||
all_batches = [batch for batch in test_loader]
|
all_batches = []
|
||||||
|
for batch in test_loader:
|
||||||
|
batch = move_batch_to_device(batch, torch.device('cpu'))
|
||||||
|
all_batches.append(batch)
|
||||||
|
|
||||||
batches_cache_dir.mkdir(parents=True, exist_ok=True)
|
batches_cache_dir.mkdir(parents=True, exist_ok=True)
|
||||||
shard_idx = 0
|
shard_idx = 0
|
||||||
|
|||||||
@@ -1,29 +1,5 @@
|
|||||||
filelock==3.25.2
|
|
||||||
fsspec==2026.2.0
|
|
||||||
Jinja2==3.1.6
|
|
||||||
joblib==1.5.3
|
|
||||||
MarkupSafe==3.0.3
|
|
||||||
mpmath==1.3.0
|
|
||||||
networkx==3.4.2
|
|
||||||
numpy==2.2.6
|
|
||||||
nvidia-cublas-cu12==12.4.5.8
|
|
||||||
nvidia-cuda-cupti-cu12==12.4.127
|
|
||||||
nvidia-cuda-nvrtc-cu12==12.4.127
|
|
||||||
nvidia-cuda-runtime-cu12==12.4.127
|
|
||||||
nvidia-cudnn-cu12==9.1.0.70
|
|
||||||
nvidia-cufft-cu12==11.2.1.3
|
|
||||||
nvidia-curand-cu12==10.3.5.147
|
|
||||||
nvidia-cusolver-cu12==11.6.1.9
|
|
||||||
nvidia-cusparse-cu12==12.3.1.170
|
|
||||||
nvidia-cusparselt-cu12==0.6.2
|
|
||||||
nvidia-nccl-cu12==2.21.5
|
|
||||||
nvidia-nvjitlink-cu12==12.4.127
|
|
||||||
nvidia-nvtx-cu12==12.4.127
|
|
||||||
scikit-learn==1.7.2
|
|
||||||
scipy==1.15.3
|
|
||||||
sympy==1.13.1
|
|
||||||
threadpoolctl==3.6.0
|
|
||||||
torch==2.6.0
|
torch==2.6.0
|
||||||
tqdm==4.67.3
|
|
||||||
triton==3.2.0
|
triton==3.2.0
|
||||||
typing_extensions==4.15.0
|
numpy==2.2.6
|
||||||
|
scikit-learn==1.7.2
|
||||||
|
tqdm==4.67.3
|
||||||
|
|||||||
Reference in New Issue
Block a user