CTI-Inference-Opt

Author	SHA1	Message	Date
OwnerSunshine530	cdc2dd490b	feat: Triton varlen因果flash attention(块对角,单kernel,消逐块调用+mask构造开销) 每program处理(用户段query块,head),只遍历段内<=该块的key(因果),在线softmax, fp16读写fp32累加。CONFIG.attn=triton(默认仍chunked);_triton_block_meta每batch算一次 block→段映射8层复用;_resolve_attn在无triton/CPU时回退chunked。等价测试+bench --attn triton。数学等价(FlashAttention同类,规则允许),不改组网。 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 00:14:53 +08:00
OwnerSunshine530	6625666010	feat: sparse_pool 选项 — (段×唯一)稀疏矩阵乘做池化,避免materialize[M,emb] 针对 profile 的 dedup展开(15%)+segment_reduce(6.6%)。段内高重复(slot19)塌缩为单个带权项。CONFIG.sparse_pool;bench --sparse-pool;等价测试已加。默认关,待验证。 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:15:13 +08:00
OwnerSunshine530	3d28f61a98	feat: 分块SDPA注意力(--attn chunked)，按用户边界切块降O(S²) 每块~chunk_users个用户、块内因果SDPA(评测端已验证、无嵌套开销)，sum(块S²) 远小于总S²。仅1次同步读切分边界。之前本地bs=16快13%被MoE同步吃掉，现MoE 同步已消除，切块红利应全露出。CONFIG.attn=chunked/chunk_users；等价测试已加。 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 13:13:13 +08:00
OwnerSunshine530	cb2913cda8	perf: searchsorted 构造因果mask，消除最后一个同步点(repeat_interleave张量repeats) dense MoE 去掉MoE的nonzero同步省了评测20s；embedding融合(无同步)只省1s ->真正的杠杆是消同步点。mask构造的repeat_interleave(lengths张量)是model(batch) 内最后一个同步点，改用searchsorted求doc_id(输出size已知,无同步)。等价测试已加。 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 12:09:40 +08:00
OwnerSunshine530	928de22a9b	perf: RepEncoder 融合 28-slot 查表+池化为单次(减per-batch kernel启动,无新增同步) 延续 dense MoE 的胜因(消 per-batch 开销在评测端被放大见效)。28次embedding +28次segment_reduce 融合为1次；用 numel 读shape避免同步；base累加无同步。保留 _rep_forward_perslot 作等价对照。CONFIG.fuse_embedding 默认 True。 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 11:50:11 +08:00
OwnerSunshine530	7791674a32	feat: 嵌套张量变长 flash 注意力(--attn varlen)，统一 CONFIG.attn 分发每用户当独立序列、is_causal 块对角因果，一个 flash 内核处理一 batch 内所有用户，无稠密mask/无padding浪费/开销远低于FlexAttention。CONFIG.attn∈ {sdpa(默认),flex,varlen}；bench --attn varlen；test_equiv 加 varlen 等价测试。 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 09:06:11 +08:00
OwnerSunshine530	c1d8b91fb2	feat(Phase B): FlexAttention 块对角注意力 + MoE 稠密向量化 - scaled_dot_product 分发：block_mask->FlexAttention(每用户仅自身序列内因果，避免对~14000长拼接序列做O(S²)稠密注意力)；否则SDPA稠密(回退/对照)。 - CTRModel.build_block_mask 构造块对角因果mask；_use_flex 在SM80+自动启用。 - SMoE 稠密向量化(einsum批量算所有expert后按top-k gather)，消除Python循环/同步；保留 _smoe_forward_loop 作数值等价对照。CONFIG.vectorize_moe 可切。 - load_model 加可选 torch.compile。 - tests/test_equiv.py：MoE稠密vs循环、Flex vs稠密SDPA 数值等价(无pytest依赖)。 - bench.py 加 --attn/--moe/--compile 便于A800上对比测速。需 A800(SM80) 实测；CPU/V100 自动回退 SDPA。 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 23:30:59 +08:00

7 Commits