feat: movedev_rep — 在move_batch_to_device(不计时/主进程/有模型有数据)算rep,model跳过embedding

collate_rep 评测端回退(疑num_workers>0子进程无模型)。move_batch_to_device官方明确不计入、
在主进程model(batch)之前调用→有CUDA+_MODEL_REF+batch数据,避开数据访问/调用顺序/子进程三大坑。
rep逐位等价。bench --no-movedev-rep 对照。

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
OwnerSunshine530
2026-06-16 19:37:34 +08:00
parent e1ad26867e
commit 4ea6d57a07
2 changed files with 17 additions and 2 deletions
+4
View File
@@ -330,6 +330,8 @@ def _parse_args():
help="走评测路径:load_model 流式过滤自动预计算(本地验证不OOM)")
ap.add_argument("--no-collate-rep", action="store_true",
help="关闭 collate 内算 rep(用于对照基准)")
ap.add_argument("--no-movedev-rep", action="store_true",
help="关闭 move_batch_to_device 内算 rep(用于对照基准)")
ap.add_argument("--profile", type=int, default=None, metavar="N",
help="剖析前 N 个 batch,打印按 CUDA 耗时排序的算子表(定位瓶颈)")
ap.add_argument("--rebuild", action="store_true", help="强制重建过滤缓存")
@@ -373,6 +375,8 @@ if __name__ == "__main__":
cfg["eval_precompute"] = True
if a.no_collate_rep:
cfg["collate_rep"] = False
if a.no_movedev_rep:
cfg["movedev_rep"] = False
if a.compile:
cfg["compile"] = True
if a.profile is not None: