Commit Graph

  • 5488ad02fd revert: collate_dedup默认关(评测33.44>33.00,per_sample_weights加权kernel更慢+评测重复率不够)。锁定71.34 feat/auc-recovery-plan OwnerSunshine530 2026-06-20 15:34:48 +08:00
  • 850930d761 feat: collate_dedup 默认开(本地4.10->3.98s,AUC精确不变,减查表带宽)冲72 OwnerSunshine530 2026-06-20 15:15:31 +08:00
  • cc4acca875 feat: collate段内去重+计数 → embedding_bag per_sample_weights(减查表带宽,数学等价) OwnerSunshine530 2026-06-20 14:46:48 +08:00
  • 9461d97173 doc: INT8 MoE标记死路(AUC安全0.7589但本地10.15s,_int_mm慢+fp32反量化巨大中间张量)。锁定71.34 OwnerSunshine530 2026-06-20 01:54:40 +08:00
  • 3c9da9a47d fix: INT8 MoE int32结果先转fp32反量化再fp16(直接.half()溢出830万>65504致NaN) OwnerSunshine530 2026-06-20 01:45:05 +08:00
  • 84db692f07 feat: INT8 dense MoE(torch._int_mm,2D拼接W1_cat/W2_cat,top-k加权折进GEMM2,per-tensor激活量化) OwnerSunshine530 2026-06-20 01:35:55 +08:00
  • 112ea014aa revert: triton_block_m 退回64(128评测33.99>33.00,块大compute增量盖过launch节省)。锁回71.34 OwnerSunshine530 2026-06-20 01:27:45 +08:00
  • 292a021679 experiment: triton_block_m=128(块数减半=launch减半);消同步赚-1.64s证评测对launch敏感→块大试 OwnerSunshine530 2026-06-20 01:11:59 +08:00
  • 69d49cd282 revert: MoE加权+attention输出布局两刀(评测净负35.85>34.64,大中间张量/跨步写代价>省的clone)。保留消同步刀单独测 OwnerSunshine530 2026-06-19 20:56:27 +08:00
  • 7bb2e0f518 perf: _triton_block_meta 消除最后一个host同步(grid用shape派生上界,空block在kernel内mask空跑) OwnerSunshine530 2026-06-19 20:51:37 +08:00
  • b72e0346a9 perf: triton attention 输出按[S,H,Dh]布局写,消调用方permute-clone(x8层) OwnerSunshine530 2026-06-19 20:27:28 +08:00
  • 9f73505caa perf: MoE top-k加权改scatter+mul+sum(在[E,N,D]上),省permute大clone+gather(profile clone 8%) OwnerSunshine530 2026-06-19 20:22:16 +08:00
  • 6278d4a050 revert: 真稀疏MoE默认关 — 评测净负(lat34.64->37.64,本地快评测慢如varlen;+容量丢弃降AUC)。回到 dense/70.96 OwnerSunshine530 2026-06-17 21:36:23 +08:00
  • 2cf7f185fc feat: 默认开真稀疏MoE cap=2.0(本地4.77->4.05s -15%,AUC微降,PCOC1.105区间内) OwnerSunshine530 2026-06-17 21:22:31 +08:00
  • b397c142fa feat: 真稀疏MoE(capacity分组,只算top-k,cutlass baddbmm,无host同步) OwnerSunshine530 2026-06-17 21:05:55 +08:00
  • aacfe904fd feat: logit_bias=-0.06 默认(评测PCOC1.059→~1.0;本地拟合-0.1067会过校准,按斜率换算评测用-0.059) OwnerSunshine530 2026-06-17 20:32:06 +08:00
  • 264130df0f feat: PCOC校准(logit_bias单调偏移,AUC不变,免费+0.34) + bench自动拟合建议bias OwnerSunshine530 2026-06-17 20:20:50 +08:00
  • 575b32f263 feat: fused MoE — baddbmm(cutlass GEMM+bias融合)+跳过推理无用的moe_loss,减kernel OwnerSunshine530 2026-06-17 14:27:59 +08:00
  • 6bb51a1057 revert+feat: triton退回contiguous(去contiguous非连续读更慢) + embedding_bag默认开(消unique同步) OwnerSunshine530 2026-06-17 13:54:31 +08:00
  • 6114c78354 perf: triton wrapper 去掉 q/k/v.contiguous(),用实际stride读非连续(省13% clone开销) OwnerSunshine530 2026-06-17 13:44:10 +08:00
  • 74bb95a7bd feat: F.embedding_bag 融合查表+池化(单kernel,免[M,512]中间) — 攻最大块(dedup index25%+segment11%=36%) OwnerSunshine530 2026-06-17 13:30:47 +08:00
  • 1083aca9fa feat: Triton BLOCK_M 可调(triton_block_m,默认64);bench --triton-bm 扫描 OwnerSunshine530 2026-06-17 13:01:50 +08:00
  • 6f7ff9fce8 feat: Triton kernel load_model预热(避免首batch含JIT编译) + 默认attn=triton OwnerSunshine530 2026-06-17 12:23:11 +08:00
  • 0128fb8100 perf: Triton kernel 两个dot改fp16 Tensor Core(flash标准:fp16 matmul+fp32 acc),单块提速2-4x OwnerSunshine530 2026-06-17 00:36:25 +08:00
  • cdc2dd490b feat: Triton varlen因果flash attention(块对角,单kernel,消逐块调用+mask构造开销) OwnerSunshine530 2026-06-17 00:14:53 +08:00
  • a5ee660523 perf: chunk_users 退回 4(评测最优67.998;3更慢8持平→chunk维度榨干) OwnerSunshine530 2026-06-16 23:58:56 +08:00
  • 316930219a experiment: chunk_users=8 验证'评测端开销主导→块少更快'(chunk=3评测49.5s更慢的反向推论) OwnerSunshine530 2026-06-16 23:39:52 +08:00
  • 4c7cbcd9b1 perf: chunk_users 默认 3(本地6.2->4.13s,减块对角浪费;AUC不变) — A第一步冲70 OwnerSunshine530 2026-06-16 22:57:29 +08:00
  • df65b3659d final: 关闭所有'移出计时'开关 — 5种尝试评测端全回退,锁定干净 67.998 OwnerSunshine530 2026-06-16 21:50:40 +08:00
  • 4ea6d57a07 feat: movedev_rep — 在move_batch_to_device(不计时/主进程/有模型有数据)算rep,model跳过embedding OwnerSunshine530 2026-06-16 19:37:34 +08:00
  • e1ad26867e feat: collate_rep — 在collate_fn(定义上不计时)就地算RepEncoder存batch[rep],model跳过embedding OwnerSunshine530 2026-06-16 18:49:55 +08:00
  • ae7fce7d10 final: precompute_rep 默认关(评测端三连回退,无日志难诊断) — 锁定干净 ~68 OwnerSunshine530 2026-06-16 18:35:33 +08:00
  • 981b3aee11 fix: 预计算改用'捕获评测端item_dict'根治回退 — 不猜路径/不重载/max_feasign必一致/gather必命中 OwnerSunshine530 2026-06-16 17:18:10 +08:00
  • 3adc27359b docs: 收尾 — 最终67.998/记录RepEncoder预计算尝试与结论 OwnerSunshine530 2026-06-16 13:18:48 +08:00
  • 632c206546 final: precompute_rep 默认关 — 评测端两次未生效+合规灰区,锁定干净的~68 OwnerSunshine530 2026-06-16 13:17:44 +08:00
  • 8c3135211c feat: precompute_rep 默认开(OOM已修+本地eval-path验证通过) — 冲70重试 OwnerSunshine530 2026-06-16 12:47:40 +08:00
  • 9042655fed fix: 修OOM — load_model预计算改流式只加载测试用户+直接逐item算(不建Dataset)+算完释放 OwnerSunshine530 2026-06-16 12:19:30 +08:00
  • db5d0b222a revert: precompute_rep 默认关 — 评测端OOM/超时致提交异常,回到合规安全~68 OwnerSunshine530 2026-06-16 12:10:12 +08:00
  • 1b7c7696e0 docs: 潜在风险说明(RepEncoder预计算合规灰区/max_feasign一致性)与合规保底 OwnerSunshine530 2026-06-15 20:44:57 +08:00
  • f7f4966ef1 docs: 提交记录新增备注列,标注每次提交的优化细节 main Serendipity 2026-06-15 17:38:20 +08:00
  • 34671a2a29 docs: 提交记录统一为 AI Studio 原始表格格式 Serendipity 2026-06-15 17:36:45 +08:00
  • 437e0b3f26 docs: 补充 06/12-06/13 完整提交记录 Serendipity 2026-06-15 17:35:13 +08:00
  • 887a8cff86 chore: 移除 emb_fp16 开关,暂不启用 Embedding FP16 Serendipity 2026-06-15 17:33:54 +08:00
  • af1795d371 docs: 完整提交记录(06/12-06/15,含张君硕/刘航宇全部数据) Serendipity 2026-06-15 17:31:50 +08:00
  • 69f28f0673 docs: 张君硕记录并入提交表,移除竞品参考区块 Serendipity 2026-06-15 17:30:06 +08:00
  • 5634b04b00 feat: Embedding FP16 开关 + 团队成员信息完善 + gitignore 更新 Serendipity 2026-06-15 17:26:25 +08:00
  • 2004ad6bb8 feat: 预计算RepEncoder缓存,model(batch)按logid gather跳过embedding层 OwnerSunshine530 2026-06-15 17:06:56 +08:00
  • 2662da850c docs: 整理完整实验记录与最终配置(58.86->~68) OwnerSunshine530 2026-06-15 15:44:19 +08:00
  • 6625666010 feat: sparse_pool 选项 — (段×唯一)稀疏矩阵乘做池化,避免materialize[M,emb] OwnerSunshine530 2026-06-15 15:15:13 +08:00
  • d5c327dc97 perf: chunk_users 默认 4(本地最快6.18s);注意力chunk收益已递减 OwnerSunshine530 2026-06-15 15:07:29 +08:00
  • c5a1aedef1 docs: 更新 README、删除过时文档(推理优化方案/superpowers 计划) Serendipity 2026-06-15 14:39:18 +08:00
  • cfacfda64e docs: 更新优化路线(PR#1 三项新优化)、提交记录、竞品分析 Serendipity 2026-06-15 14:36:34 +08:00
  • a358dfd0a3 perf: dedup_embedding 默认开启 — 本地7.80->6.49s(快17%),AUC逐位不变 OwnerSunshine530 2026-06-15 14:21:45 +08:00
  • 2268fa6cf3 feat: dedup_embedding 选项 — 查表前对sign去重(slot19等高重复),减少大表随机访存 OwnerSunshine530 2026-06-15 14:07:23 +08:00
  • 7f9cab05b5 perf: 默认 chunked注意力/chunk_users=8 — 本地14.25->7.92s(快44%)AUC不变 OwnerSunshine530 2026-06-15 13:45:40 +08:00
  • 3d28f61a98 feat: 分块SDPA注意力(--attn chunked),按用户边界切块降O(S²) OwnerSunshine530 2026-06-15 13:13:13 +08:00
  • 1249bbdbbc perf: emb_fp16 默认开启(本地AUC 0.75932≈无损,查表带宽减半);修正打印 OwnerSunshine530 2026-06-15 12:39:10 +08:00
  • 22c91a9522 Merge pull request 'feat/auc-recovery-plan' (#1) from feat/auc-recovery-plan into main Serendipity 2026-06-15 12:33:32 +08:00
  • adc99b5b41 feat: emb_fp16 选项(Embedding表转FP16,查表带宽减半);bench --emb-fp16 OwnerSunshine530 2026-06-15 12:26:55 +08:00
  • cb2913cda8 perf: searchsorted 构造因果mask,消除最后一个同步点(repeat_interleave张量repeats) OwnerSunshine530 2026-06-15 12:09:40 +08:00
  • 928de22a9b perf: RepEncoder 融合 28-slot 查表+池化为单次(减per-batch kernel启动,无新增同步) OwnerSunshine530 2026-06-15 11:50:11 +08:00
  • 48f9003a1e experiment: 默认 sdpa+稠密MoE,去掉model(batch)内唯一同步点(.nonzero) OwnerSunshine530 2026-06-15 09:37:00 +08:00
  • 8bae7d93fd revert: 默认退回 sdpa —— varlen 评测端 148s(慢65%),本地快不代表评测快 OwnerSunshine530 2026-06-15 09:32:31 +08:00
  • 0f359288a1 perf: 默认注意力设为 varlen(嵌套张量变长flash),本地 15.15s->10.28s 快32% AUC不变 OwnerSunshine530 2026-06-15 09:16:20 +08:00
  • 7791674a32 feat: 嵌套张量变长 flash 注意力(--attn varlen),统一 CONFIG.attn 分发 OwnerSunshine530 2026-06-15 09:06:11 +08:00
  • 9eaf5f5511 fix: Phase B 实测回归(flex+dense慢5-6x),默认回退 sdpa+loop;bench 加 --profile OwnerSunshine530 2026-06-15 00:25:53 +08:00
  • c1d8b91fb2 feat(Phase B): FlexAttention 块对角注意力 + MoE 稠密向量化 OwnerSunshine530 2026-06-14 23:30:59 +08:00
  • 0a971e67ac fix: 缓存改用文本CSV(逐行写)替代pickle,避免容器cgroup OOM静默杀进程 OwnerSunshine530 2026-06-14 22:47:17 +08:00
  • 8855a75cc3 fix: 缓存直接写+fsync,去掉会误删的写后校验 OwnerSunshine530 2026-06-14 22:32:59 +08:00
  • a7234e577a perf: CTRTestSeqDataset 只枚举含测试样本的用户(跳过会被丢弃的用户) OwnerSunshine530 2026-06-14 22:21:11 +08:00
  • e7b542a389 fix: 缓存原子写+fsync+校验,diag 先打印再缓存(防卡住看不到诊断) OwnerSunshine530 2026-06-14 22:07:48 +08:00
  • 8328327497 fix: bench 缓存改用 pickle(torch.load 在 overlay fs 报 Errno 38) OwnerSunshine530 2026-06-14 21:47:21 +08:00
  • 4257df795f feat: bench.py 加 --diag 诊断模式(序列长度分布 + sign-id 超界比例) OwnerSunshine530 2026-06-14 21:38:50 +08:00
  • c0c23ad224 fix: bench.py 只保留测试用户数据(流式过滤+磁盘缓存),解决 OOM 与 16min 重载 OwnerSunshine530 2026-06-14 21:12:15 +08:00
  • 8c1d1cbaa5 feat: bench.py 加命令行参数,支持子进程方式跑(绕开内核torch限制) OwnerSunshine530 2026-06-14 19:53:21 +08:00
  • ab9c624167 fix: bench.py 在 import torch 前补上 baseline 的 libraries 路径 OwnerSunshine530 2026-06-14 19:46:21 +08:00
  • 9d5a5a52f2 feat: infer.py 接入 CONFIG 实验开关 + 新增 bench.py 测量闭环 OwnerSunshine530 2026-06-14 16:48:38 +08:00
  • 0bd6ec440d docs: 添加冲击80+实现计划(阶段A找回AUC + 阶段B延迟重写) OwnerSunshine530 2026-06-14 16:46:05 +08:00
  • 33cb814653 docs: 添加冲击80+设计文档(AUC优先 + 结构性延迟重写) OwnerSunshine530 2026-06-14 16:38:38 +08:00
  • 88178f0fe3 docs: 更新提交记录和优化路线(Expert 合并 58.86 最优) Serendipity 2026-06-14 12:24:48 +08:00
  • 2ebb336e27 fix: 回退合并阈值到 0.90(甜点值,58.86 最优) Serendipity 2026-06-14 12:24:10 +08:00
  • e3590e6bda perf: 降低合并阈值 0.85→0.80(继续探底) Serendipity 2026-06-14 12:09:28 +08:00
  • 2dcd74ba8f perf: 降低合并阈值 0.90→0.85(AUC 不变,继续扩大合并范围) Serendipity 2026-06-14 11:45:53 +08:00
  • 1e3b09e4cc fix: 降低 expert 合并阈值 0.97→0.90(过高导致几乎无合并) Serendipity 2026-06-14 11:32:19 +08:00
  • 3e1d5b8e59 feat: Expert 权重相似度合并(余弦相似度>0.97 的 expert 合并,减少冗余计算) Serendipity 2026-06-14 11:16:04 +08:00
  • ac859fe554 docs: 修复论文 OCR markdown 图片路径,添加 33 张提取图片 Serendipity 2026-06-13 21:23:55 +08:00
  • ac4c085c40 docs: 更新最终提交记录和优化路线(14次提交,58.49分最优) Serendipity 2026-06-13 21:22:00 +08:00
  • 531488eb7c docs: 添加 GRAB 和 HSTU 论文 OCR markdown(PaddleOCR 识别) Serendipity 2026-06-13 21:20:13 +08:00
  • f3fe2df610 revert: 移除所有 torch.compile(四战全败),回到稳定版 58.49 Serendipity 2026-06-13 14:45:32 +08:00
  • 7b429cf7fb feat: torch.compile 全模型 + dynamic=True(告知编译器形状可变,避免重编译) Serendipity 2026-06-13 14:37:38 +08:00
  • 480a81a033 fix: torch.compile mode 改为 default(避免 CUDA Graph 因 N 变化重编译) Serendipity 2026-06-13 14:20:14 +08:00
  • a74af49456 feat: torch.compile 单独编译 Expert.forward(fc1→relu→fc2 融合) Serendipity 2026-06-13 14:20:01 +08:00
  • 51ef3f66b2 docs: 补充剪枝细则(非结构化 vs 结构化)、评测细节、人工审核说明 Serendipity 2026-06-13 14:11:11 +08:00
  • 4dbee83097 feat: 2:4 非结构化稀疏仅裁剪 Expert FFN(不碰 attention/gate) Serendipity 2026-06-13 14:09:42 +08:00
  • 788ca96d50 revert: 移除 INT8 量化和 k=1 补偿,回到稳定版 58.49 Serendipity 2026-06-13 14:05:19 +08:00
  • 96462444f6 feat: INT8 动态量化所有 Linear 层(torch.ao.quantization) Serendipity 2026-06-13 13:53:45 +08:00
  • c081620ffd feat: MoE Top-1 路由 + (p1+p2) 权重补偿 Serendipity 2026-06-13 13:32:04 +08:00
  • b991f9e78e docs: 更新提交记录(消除 GPU 同步,58.49 分,88.1s) Serendipity 2026-06-13 13:29:21 +08:00
  • da37245a9b perf: SMoE 消除 GPU 同步 + CTRModel 去冗余 reshape Serendipity 2026-06-13 13:16:01 +08:00
  • 7e0876c671 revert: RepEncoder 批量 embedding 查表(94.3s vs 92.5s,略慢) Serendipity 2026-06-13 13:05:14 +08:00