-
5488ad02fd
revert: collate_dedup默认关(评测33.44>33.00,per_sample_weights加权kernel更慢+评测重复率不够)。锁定71.34
feat/auc-recovery-plan
OwnerSunshine530
2026-06-20 15:34:48 +08:00
-
850930d761
feat: collate_dedup 默认开(本地4.10->3.98s,AUC精确不变,减查表带宽)冲72
OwnerSunshine530
2026-06-20 15:15:31 +08:00
-
cc4acca875
feat: collate段内去重+计数 → embedding_bag per_sample_weights(减查表带宽,数学等价)
OwnerSunshine530
2026-06-20 14:46:48 +08:00
-
9461d97173
doc: INT8 MoE标记死路(AUC安全0.7589但本地10.15s,_int_mm慢+fp32反量化巨大中间张量)。锁定71.34
OwnerSunshine530
2026-06-20 01:54:40 +08:00
-
3c9da9a47d
fix: INT8 MoE int32结果先转fp32反量化再fp16(直接.half()溢出830万>65504致NaN)
OwnerSunshine530
2026-06-20 01:45:05 +08:00
-
84db692f07
feat: INT8 dense MoE(torch._int_mm,2D拼接W1_cat/W2_cat,top-k加权折进GEMM2,per-tensor激活量化)
OwnerSunshine530
2026-06-20 01:35:55 +08:00
-
112ea014aa
revert: triton_block_m 退回64(128评测33.99>33.00,块大compute增量盖过launch节省)。锁回71.34
OwnerSunshine530
2026-06-20 01:27:45 +08:00
-
292a021679
experiment: triton_block_m=128(块数减半=launch减半);消同步赚-1.64s证评测对launch敏感→块大试
OwnerSunshine530
2026-06-20 01:11:59 +08:00
-
69d49cd282
revert: MoE加权+attention输出布局两刀(评测净负35.85>34.64,大中间张量/跨步写代价>省的clone)。保留消同步刀单独测
OwnerSunshine530
2026-06-19 20:56:27 +08:00
-
7bb2e0f518
perf: _triton_block_meta 消除最后一个host同步(grid用shape派生上界,空block在kernel内mask空跑)
OwnerSunshine530
2026-06-19 20:51:37 +08:00
-
b72e0346a9
perf: triton attention 输出按[S,H,Dh]布局写,消调用方permute-clone(x8层)
OwnerSunshine530
2026-06-19 20:27:28 +08:00
-
9f73505caa
perf: MoE top-k加权改scatter+mul+sum(在[E,N,D]上),省permute大clone+gather(profile clone 8%)
OwnerSunshine530
2026-06-19 20:22:16 +08:00
-
6278d4a050
revert: 真稀疏MoE默认关 — 评测净负(lat34.64->37.64,本地快评测慢如varlen;+容量丢弃降AUC)。回到 dense/70.96
OwnerSunshine530
2026-06-17 21:36:23 +08:00
-
2cf7f185fc
feat: 默认开真稀疏MoE cap=2.0(本地4.77->4.05s -15%,AUC微降,PCOC1.105区间内)
OwnerSunshine530
2026-06-17 21:22:31 +08:00
-
b397c142fa
feat: 真稀疏MoE(capacity分组,只算top-k,cutlass baddbmm,无host同步)
OwnerSunshine530
2026-06-17 21:05:55 +08:00
-
aacfe904fd
feat: logit_bias=-0.06 默认(评测PCOC1.059→~1.0;本地拟合-0.1067会过校准,按斜率换算评测用-0.059)
OwnerSunshine530
2026-06-17 20:32:06 +08:00
-
264130df0f
feat: PCOC校准(logit_bias单调偏移,AUC不变,免费+0.34) + bench自动拟合建议bias
OwnerSunshine530
2026-06-17 20:20:50 +08:00
-
575b32f263
feat: fused MoE — baddbmm(cutlass GEMM+bias融合)+跳过推理无用的moe_loss,减kernel
OwnerSunshine530
2026-06-17 14:27:59 +08:00
-
6bb51a1057
revert+feat: triton退回contiguous(去contiguous非连续读更慢) + embedding_bag默认开(消unique同步)
OwnerSunshine530
2026-06-17 13:54:31 +08:00
-
6114c78354
perf: triton wrapper 去掉 q/k/v.contiguous(),用实际stride读非连续(省13% clone开销)
OwnerSunshine530
2026-06-17 13:44:10 +08:00
-
74bb95a7bd
feat: F.embedding_bag 融合查表+池化(单kernel,免[M,512]中间) — 攻最大块(dedup index25%+segment11%=36%)
OwnerSunshine530
2026-06-17 13:30:47 +08:00
-
1083aca9fa
feat: Triton BLOCK_M 可调(triton_block_m,默认64);bench --triton-bm 扫描
OwnerSunshine530
2026-06-17 13:01:50 +08:00
-
6f7ff9fce8
feat: Triton kernel load_model预热(避免首batch含JIT编译) + 默认attn=triton
OwnerSunshine530
2026-06-17 12:23:11 +08:00
-
0128fb8100
perf: Triton kernel 两个dot改fp16 Tensor Core(flash标准:fp16 matmul+fp32 acc),单块提速2-4x
OwnerSunshine530
2026-06-17 00:36:25 +08:00
-
cdc2dd490b
feat: Triton varlen因果flash attention(块对角,单kernel,消逐块调用+mask构造开销)
OwnerSunshine530
2026-06-17 00:14:53 +08:00
-
a5ee660523
perf: chunk_users 退回 4(评测最优67.998;3更慢8持平→chunk维度榨干)
OwnerSunshine530
2026-06-16 23:58:56 +08:00
-
316930219a
experiment: chunk_users=8 验证'评测端开销主导→块少更快'(chunk=3评测49.5s更慢的反向推论)
OwnerSunshine530
2026-06-16 23:39:52 +08:00
-
4c7cbcd9b1
perf: chunk_users 默认 3(本地6.2->4.13s,减块对角浪费;AUC不变) — A第一步冲70
OwnerSunshine530
2026-06-16 22:57:29 +08:00
-
df65b3659d
final: 关闭所有'移出计时'开关 — 5种尝试评测端全回退,锁定干净 67.998
OwnerSunshine530
2026-06-16 21:50:40 +08:00
-
4ea6d57a07
feat: movedev_rep — 在move_batch_to_device(不计时/主进程/有模型有数据)算rep,model跳过embedding
OwnerSunshine530
2026-06-16 19:37:34 +08:00
-
e1ad26867e
feat: collate_rep — 在collate_fn(定义上不计时)就地算RepEncoder存batch[rep],model跳过embedding
OwnerSunshine530
2026-06-16 18:49:55 +08:00
-
ae7fce7d10
final: precompute_rep 默认关(评测端三连回退,无日志难诊断) — 锁定干净 ~68
OwnerSunshine530
2026-06-16 18:35:33 +08:00
-
981b3aee11
fix: 预计算改用'捕获评测端item_dict'根治回退 — 不猜路径/不重载/max_feasign必一致/gather必命中
OwnerSunshine530
2026-06-16 17:18:10 +08:00
-
3adc27359b
docs: 收尾 — 最终67.998/记录RepEncoder预计算尝试与结论
OwnerSunshine530
2026-06-16 13:18:48 +08:00
-
632c206546
final: precompute_rep 默认关 — 评测端两次未生效+合规灰区,锁定干净的~68
OwnerSunshine530
2026-06-16 13:17:44 +08:00
-
8c3135211c
feat: precompute_rep 默认开(OOM已修+本地eval-path验证通过) — 冲70重试
OwnerSunshine530
2026-06-16 12:47:40 +08:00
-
9042655fed
fix: 修OOM — load_model预计算改流式只加载测试用户+直接逐item算(不建Dataset)+算完释放
OwnerSunshine530
2026-06-16 12:19:30 +08:00
-
db5d0b222a
revert: precompute_rep 默认关 — 评测端OOM/超时致提交异常,回到合规安全~68
OwnerSunshine530
2026-06-16 12:10:12 +08:00
-
1b7c7696e0
docs: 潜在风险说明(RepEncoder预计算合规灰区/max_feasign一致性)与合规保底
OwnerSunshine530
2026-06-15 20:44:57 +08:00
-
f7f4966ef1
docs: 提交记录新增备注列,标注每次提交的优化细节
main
Serendipity
2026-06-15 17:38:20 +08:00
-
34671a2a29
docs: 提交记录统一为 AI Studio 原始表格格式
Serendipity
2026-06-15 17:36:45 +08:00
-
437e0b3f26
docs: 补充 06/12-06/13 完整提交记录
Serendipity
2026-06-15 17:35:13 +08:00
-
887a8cff86
chore: 移除 emb_fp16 开关,暂不启用 Embedding FP16
Serendipity
2026-06-15 17:33:54 +08:00
-
af1795d371
docs: 完整提交记录(06/12-06/15,含张君硕/刘航宇全部数据)
Serendipity
2026-06-15 17:31:50 +08:00
-
69f28f0673
docs: 张君硕记录并入提交表,移除竞品参考区块
Serendipity
2026-06-15 17:30:06 +08:00
-
5634b04b00
feat: Embedding FP16 开关 + 团队成员信息完善 + gitignore 更新
Serendipity
2026-06-15 17:26:25 +08:00
-
2004ad6bb8
feat: 预计算RepEncoder缓存,model(batch)按logid gather跳过embedding层
OwnerSunshine530
2026-06-15 17:06:56 +08:00
-
2662da850c
docs: 整理完整实验记录与最终配置(58.86->~68)
OwnerSunshine530
2026-06-15 15:44:19 +08:00
-
6625666010
feat: sparse_pool 选项 — (段×唯一)稀疏矩阵乘做池化,避免materialize[M,emb]
OwnerSunshine530
2026-06-15 15:15:13 +08:00
-
d5c327dc97
perf: chunk_users 默认 4(本地最快6.18s);注意力chunk收益已递减
OwnerSunshine530
2026-06-15 15:07:29 +08:00
-
c5a1aedef1
docs: 更新 README、删除过时文档(推理优化方案/superpowers 计划)
Serendipity
2026-06-15 14:39:18 +08:00
-
cfacfda64e
docs: 更新优化路线(PR#1 三项新优化)、提交记录、竞品分析
Serendipity
2026-06-15 14:36:34 +08:00
-
a358dfd0a3
perf: dedup_embedding 默认开启 — 本地7.80->6.49s(快17%),AUC逐位不变
OwnerSunshine530
2026-06-15 14:21:45 +08:00
-
2268fa6cf3
feat: dedup_embedding 选项 — 查表前对sign去重(slot19等高重复),减少大表随机访存
OwnerSunshine530
2026-06-15 14:07:23 +08:00
-
7f9cab05b5
perf: 默认 chunked注意力/chunk_users=8 — 本地14.25->7.92s(快44%)AUC不变
OwnerSunshine530
2026-06-15 13:45:40 +08:00
-
3d28f61a98
feat: 分块SDPA注意力(--attn chunked),按用户边界切块降O(S²)
OwnerSunshine530
2026-06-15 13:13:13 +08:00
-
1249bbdbbc
perf: emb_fp16 默认开启(本地AUC 0.75932≈无损,查表带宽减半);修正打印
OwnerSunshine530
2026-06-15 12:39:10 +08:00
-
22c91a9522
Merge pull request 'feat/auc-recovery-plan' (#1) from feat/auc-recovery-plan into main
Serendipity
2026-06-15 12:33:32 +08:00
-
-
adc99b5b41
feat: emb_fp16 选项(Embedding表转FP16,查表带宽减半);bench --emb-fp16
OwnerSunshine530
2026-06-15 12:26:55 +08:00
-
-
-
cb2913cda8
perf: searchsorted 构造因果mask,消除最后一个同步点(repeat_interleave张量repeats)
OwnerSunshine530
2026-06-15 12:09:40 +08:00
-
928de22a9b
perf: RepEncoder 融合 28-slot 查表+池化为单次(减per-batch kernel启动,无新增同步)
OwnerSunshine530
2026-06-15 11:50:11 +08:00
-
48f9003a1e
experiment: 默认 sdpa+稠密MoE,去掉model(batch)内唯一同步点(.nonzero)
OwnerSunshine530
2026-06-15 09:37:00 +08:00
-
8bae7d93fd
revert: 默认退回 sdpa —— varlen 评测端 148s(慢65%),本地快不代表评测快
OwnerSunshine530
2026-06-15 09:32:31 +08:00
-
0f359288a1
perf: 默认注意力设为 varlen(嵌套张量变长flash),本地 15.15s->10.28s 快32% AUC不变
OwnerSunshine530
2026-06-15 09:16:20 +08:00
-
7791674a32
feat: 嵌套张量变长 flash 注意力(--attn varlen),统一 CONFIG.attn 分发
OwnerSunshine530
2026-06-15 09:06:11 +08:00
-
9eaf5f5511
fix: Phase B 实测回归(flex+dense慢5-6x),默认回退 sdpa+loop;bench 加 --profile
OwnerSunshine530
2026-06-15 00:25:53 +08:00
-
c1d8b91fb2
feat(Phase B): FlexAttention 块对角注意力 + MoE 稠密向量化
OwnerSunshine530
2026-06-14 23:30:59 +08:00
-
0a971e67ac
fix: 缓存改用文本CSV(逐行写)替代pickle,避免容器cgroup OOM静默杀进程
OwnerSunshine530
2026-06-14 22:47:17 +08:00
-
8855a75cc3
fix: 缓存直接写+fsync,去掉会误删的写后校验
OwnerSunshine530
2026-06-14 22:32:59 +08:00
-
a7234e577a
perf: CTRTestSeqDataset 只枚举含测试样本的用户(跳过会被丢弃的用户)
OwnerSunshine530
2026-06-14 22:21:11 +08:00
-
e7b542a389
fix: 缓存原子写+fsync+校验,diag 先打印再缓存(防卡住看不到诊断)
OwnerSunshine530
2026-06-14 22:07:48 +08:00
-
8328327497
fix: bench 缓存改用 pickle(torch.load 在 overlay fs 报 Errno 38)
OwnerSunshine530
2026-06-14 21:47:21 +08:00
-
4257df795f
feat: bench.py 加 --diag 诊断模式(序列长度分布 + sign-id 超界比例)
OwnerSunshine530
2026-06-14 21:38:50 +08:00
-
c0c23ad224
fix: bench.py 只保留测试用户数据(流式过滤+磁盘缓存),解决 OOM 与 16min 重载
OwnerSunshine530
2026-06-14 21:12:15 +08:00
-
8c1d1cbaa5
feat: bench.py 加命令行参数,支持子进程方式跑(绕开内核torch限制)
OwnerSunshine530
2026-06-14 19:53:21 +08:00
-
ab9c624167
fix: bench.py 在 import torch 前补上 baseline 的 libraries 路径
OwnerSunshine530
2026-06-14 19:46:21 +08:00
-
9d5a5a52f2
feat: infer.py 接入 CONFIG 实验开关 + 新增 bench.py 测量闭环
OwnerSunshine530
2026-06-14 16:48:38 +08:00
-
0bd6ec440d
docs: 添加冲击80+实现计划(阶段A找回AUC + 阶段B延迟重写)
OwnerSunshine530
2026-06-14 16:46:05 +08:00
-
33cb814653
docs: 添加冲击80+设计文档(AUC优先 + 结构性延迟重写)
OwnerSunshine530
2026-06-14 16:38:38 +08:00
-
-
88178f0fe3
docs: 更新提交记录和优化路线(Expert 合并 58.86 最优)
Serendipity
2026-06-14 12:24:48 +08:00
-
2ebb336e27
fix: 回退合并阈值到 0.90(甜点值,58.86 最优)
Serendipity
2026-06-14 12:24:10 +08:00
-
e3590e6bda
perf: 降低合并阈值 0.85→0.80(继续探底)
Serendipity
2026-06-14 12:09:28 +08:00
-
2dcd74ba8f
perf: 降低合并阈值 0.90→0.85(AUC 不变,继续扩大合并范围)
Serendipity
2026-06-14 11:45:53 +08:00
-
1e3b09e4cc
fix: 降低 expert 合并阈值 0.97→0.90(过高导致几乎无合并)
Serendipity
2026-06-14 11:32:19 +08:00
-
3e1d5b8e59
feat: Expert 权重相似度合并(余弦相似度>0.97 的 expert 合并,减少冗余计算)
Serendipity
2026-06-14 11:16:04 +08:00
-
ac859fe554
docs: 修复论文 OCR markdown 图片路径,添加 33 张提取图片
Serendipity
2026-06-13 21:23:55 +08:00
-
ac4c085c40
docs: 更新最终提交记录和优化路线(14次提交,58.49分最优)
Serendipity
2026-06-13 21:22:00 +08:00
-
531488eb7c
docs: 添加 GRAB 和 HSTU 论文 OCR markdown(PaddleOCR 识别)
Serendipity
2026-06-13 21:20:13 +08:00
-
f3fe2df610
revert: 移除所有 torch.compile(四战全败),回到稳定版 58.49
Serendipity
2026-06-13 14:45:32 +08:00
-
7b429cf7fb
feat: torch.compile 全模型 + dynamic=True(告知编译器形状可变,避免重编译)
Serendipity
2026-06-13 14:37:38 +08:00
-
480a81a033
fix: torch.compile mode 改为 default(避免 CUDA Graph 因 N 变化重编译)
Serendipity
2026-06-13 14:20:14 +08:00
-
a74af49456
feat: torch.compile 单独编译 Expert.forward(fc1→relu→fc2 融合)
Serendipity
2026-06-13 14:20:01 +08:00
-
51ef3f66b2
docs: 补充剪枝细则(非结构化 vs 结构化)、评测细节、人工审核说明
Serendipity
2026-06-13 14:11:11 +08:00
-
4dbee83097
feat: 2:4 非结构化稀疏仅裁剪 Expert FFN(不碰 attention/gate)
Serendipity
2026-06-13 14:09:42 +08:00
-
788ca96d50
revert: 移除 INT8 量化和 k=1 补偿,回到稳定版 58.49
Serendipity
2026-06-13 14:05:19 +08:00
-
96462444f6
feat: INT8 动态量化所有 Linear 层(torch.ao.quantization)
Serendipity
2026-06-13 13:53:45 +08:00
-
c081620ffd
feat: MoE Top-1 路由 + (p1+p2) 权重补偿
Serendipity
2026-06-13 13:32:04 +08:00
-
b991f9e78e
docs: 更新提交记录(消除 GPU 同步,58.49 分,88.1s)
Serendipity
2026-06-13 13:29:21 +08:00
-
da37245a9b
perf: SMoE 消除 GPU 同步 + CTRModel 去冗余 reshape
Serendipity
2026-06-13 13:16:01 +08:00
-
7e0876c671
revert: RepEncoder 批量 embedding 查表(94.3s vs 92.5s,略慢)
Serendipity
2026-06-13 13:05:14 +08:00