From ac859fe554f7fbd6688225d4ede9132932d18950 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E5=88=98=E8=88=AA=E5=AE=87?= <3364451258@qq.com> Date: Sat, 13 Jun 2026 21:23:55 +0800 Subject: [PATCH] =?UTF-8?q?docs:=20=E4=BF=AE=E5=A4=8D=E8=AE=BA=E6=96=87=20?= =?UTF-8?q?OCR=20markdown=20=E5=9B=BE=E7=89=87=E8=B7=AF=E5=BE=84=EF=BC=8C?= =?UTF-8?q?=E6=B7=BB=E5=8A=A0=2033=20=E5=BC=A0=E6=8F=90=E5=8F=96=E5=9B=BE?= =?UTF-8?q?=E7=89=87?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - GRAB: 11 张图片(imgs/grab/) - HSTU: 22 张图片(imgs/hstu/) - 图片路径从 imgs/ 改为 imgs/grab/ 和 imgs/hstu/ --- 论文/GRAB.md | 607 ++++++------ 论文/HSTU.md | 884 +++++++++--------- .../grab/img_in_chart_box_110_146_498_401.jpg | Bin 0 -> 32325 bytes .../grab/img_in_chart_box_172_627_504_881.jpg | Bin 0 -> 26109 bytes .../grab/img_in_chart_box_694_136_1079_398.jpg | Bin 0 -> 35975 bytes .../grab/img_in_image_box_115_151_689_476.jpg | Bin 0 -> 45185 bytes .../grab/img_in_image_box_117_147_1083_497.jpg | Bin 0 -> 83186 bytes .../grab/img_in_image_box_137_140_560_318.jpg | Bin 0 -> 29858 bytes .../grab/img_in_image_box_190_187_498_411.jpg | Bin 0 -> 22353 bytes .../grab/img_in_image_box_218_138_976_460.jpg | Bin 0 -> 88796 bytes .../grab/img_in_image_box_221_139_977_474.jpg | Bin 0 -> 75623 bytes .../grab/img_in_image_box_246_907_444_1138.jpg | Bin 0 -> 23085 bytes .../grab/img_in_image_box_738_151_1080_498.jpg | Bin 0 -> 38097 bytes .../hstu/img_in_chart_box_111_654_336_834.jpg | Bin 0 -> 13131 bytes .../hstu/img_in_chart_box_131_1212_330_1373.jpg | Bin 0 -> 10463 bytes .../hstu/img_in_chart_box_132_124_555_383.jpg | Bin 0 -> 27599 bytes .../hstu/img_in_chart_box_135_394_553_647.jpg | Bin 0 -> 28963 bytes .../hstu/img_in_chart_box_136_657_552_901.jpg | Bin 0 -> 31353 bytes .../hstu/img_in_chart_box_175_710_511_907.jpg | Bin 0 -> 18609 bytes .../hstu/img_in_chart_box_179_423_503_668.jpg | Bin 0 -> 19074 bytes .../hstu/img_in_chart_box_180_131_507_377.jpg | Bin 0 -> 23703 bytes .../hstu/img_in_chart_box_234_650_969_970.jpg | Bin 0 -> 60343 bytes .../hstu/img_in_chart_box_308_220_881_559.jpg | Bin 0 -> 59639 bytes .../hstu/img_in_chart_box_346_660_570_836.jpg | Bin 0 -> 11436 bytes .../hstu/img_in_chart_box_368_1212_568_1372.jpg | Bin 0 -> 11193 bytes .../hstu/img_in_chart_box_607_1212_806_1371.jpg | Bin 0 -> 10885 bytes .../hstu/img_in_chart_box_669_386_1026_685.jpg | Bin 0 -> 28346 bytes .../hstu/img_in_chart_box_671_123_1023_335.jpg | Bin 0 -> 30616 bytes .../img_in_chart_box_843_1212_1043_1371.jpg | Bin 0 -> 10765 bytes .../hstu/img_in_image_box_116_139_1077_399.jpg | Bin 0 -> 87125 bytes .../hstu/img_in_image_box_205_252_758_582.jpg | Bin 0 -> 50274 bytes .../hstu/img_in_image_box_208_629_983_1103.jpg | Bin 0 -> 100216 bytes .../hstu/img_in_image_box_267_124_916_513.jpg | Bin 0 -> 117446 bytes .../hstu/img_in_image_box_307_135_886_606.jpg | Bin 0 -> 56445 bytes .../hstu/img_in_image_box_643_129_1062_604.jpg | Bin 0 -> 81158 bytes 35 files changed, 768 insertions(+), 723 deletions(-) create mode 100644 论文/imgs/grab/img_in_chart_box_110_146_498_401.jpg create mode 100644 论文/imgs/grab/img_in_chart_box_172_627_504_881.jpg create mode 100644 论文/imgs/grab/img_in_chart_box_694_136_1079_398.jpg create mode 100644 论文/imgs/grab/img_in_image_box_115_151_689_476.jpg create mode 100644 论文/imgs/grab/img_in_image_box_117_147_1083_497.jpg create mode 100644 论文/imgs/grab/img_in_image_box_137_140_560_318.jpg create mode 100644 论文/imgs/grab/img_in_image_box_190_187_498_411.jpg create mode 100644 论文/imgs/grab/img_in_image_box_218_138_976_460.jpg create mode 100644 论文/imgs/grab/img_in_image_box_221_139_977_474.jpg create mode 100644 论文/imgs/grab/img_in_image_box_246_907_444_1138.jpg create mode 100644 论文/imgs/grab/img_in_image_box_738_151_1080_498.jpg create mode 100644 论文/imgs/hstu/img_in_chart_box_111_654_336_834.jpg create mode 100644 论文/imgs/hstu/img_in_chart_box_131_1212_330_1373.jpg create mode 100644 论文/imgs/hstu/img_in_chart_box_132_124_555_383.jpg create mode 100644 论文/imgs/hstu/img_in_chart_box_135_394_553_647.jpg create mode 100644 论文/imgs/hstu/img_in_chart_box_136_657_552_901.jpg create mode 100644 论文/imgs/hstu/img_in_chart_box_175_710_511_907.jpg create mode 100644 论文/imgs/hstu/img_in_chart_box_179_423_503_668.jpg create mode 100644 论文/imgs/hstu/img_in_chart_box_180_131_507_377.jpg create mode 100644 论文/imgs/hstu/img_in_chart_box_234_650_969_970.jpg create mode 100644 论文/imgs/hstu/img_in_chart_box_308_220_881_559.jpg create mode 100644 论文/imgs/hstu/img_in_chart_box_346_660_570_836.jpg create mode 100644 论文/imgs/hstu/img_in_chart_box_368_1212_568_1372.jpg create mode 100644 论文/imgs/hstu/img_in_chart_box_607_1212_806_1371.jpg create mode 100644 论文/imgs/hstu/img_in_chart_box_669_386_1026_685.jpg create mode 100644 论文/imgs/hstu/img_in_chart_box_671_123_1023_335.jpg create mode 100644 论文/imgs/hstu/img_in_chart_box_843_1212_1043_1371.jpg create mode 100644 论文/imgs/hstu/img_in_image_box_116_139_1077_399.jpg create mode 100644 论文/imgs/hstu/img_in_image_box_205_252_758_582.jpg create mode 100644 论文/imgs/hstu/img_in_image_box_208_629_983_1103.jpg create mode 100644 论文/imgs/hstu/img_in_image_box_267_124_916_513.jpg create mode 100644 论文/imgs/hstu/img_in_image_box_307_135_886_606.jpg create mode 100644 论文/imgs/hstu/img_in_image_box_643_129_1062_604.jpg diff --git a/论文/GRAB.md b/论文/GRAB.md index 1a53789..baf0b91 100644 --- a/论文/GRAB.md +++ b/论文/GRAB.md @@ -18,7 +18,8 @@ Departing from the structural constraints of DLRMs, the rise of Large Language M Despite these theoretical advancements, deploying GR models in high-throughput industrial systems remains challenging due to strict online serving and optimization constraints. The primary obstacle is computational efficiency. Standard Transformer training requires extensive padding for variable-length sequences, resulting in significant computational waste (Vaswani et al., 2017; Krell et al., 2021). While the sequence packing—a common Natural Language Processing (NLP) technique for concatenating multiple short sequences—effectively mitigates this issue (Krell et al., 2021), its straightforward application to recommendation systems triggers a more subtle yet damaging failure mode: Distribution Skew (Baylor et al., 2017; Polyzotis et al., 2019; Sculley et al., 2015; Han et al., 2025). -In recommendations, packing a user's full history creates mini-batches with excessive intra-user correlation, which violates the i.i.d. assumption typically relied on by SGD-style optimization (Doan et al., 2020). This skew (details in Appendix D.1) causes sparse parameters (i.e., embedding tables) to overfit specific users, hindering the generalization of dense parameters (e.g., Transformer weights responsible for inference) (Naumov et al., 2019; Li et al., 2024b). This reveals a fundamental tension: sparse parameters require diverse, uncorrelated samples for robust “memorization”, whereas dense parameters benefit from long, coherent contexts for sequential “reasoning” (Cheng et al., 2016; Kang & McAuley, 2018; Sun et al., 2019). This misalignment implies that standard synchronous training on packed sequences may lead to suboptimal convergence due to the conflicting gradient requirements of the sparse and dense components (Yu et al., 2020). +In recommendations, packing a user's full history creates mini-batches with excessive intra-user correlation, which violates the i.i.d. assumption typically relied on by SGD- +style optimization (Doan et al., 2020). This skew (details in Appendix D.1) causes sparse parameters (i.e., embedding tables) to overfit specific users, hindering the generalization of dense parameters (e.g., Transformer weights responsible for inference) (Naumov et al., 2019; Li et al., 2024b). This reveals a fundamental tension: sparse parameters require diverse, uncorrelated samples for robust “memorization”, whereas dense parameters benefit from long, coherent contexts for sequential “reasoning” (Cheng et al., 2016; Kang & McAuley, 2018; Sun et al., 2019). This misalignment implies that standard synchronous training on packed sequences may lead to suboptimal convergence due to the conflicting gradient requirements of the sparse and dense components (Yu et al., 2020). Meanwhile, existing GR models typically ignore data heterogeneity, resulting in performance limitations (see Appendix A.3 for detailed discussion). To overcome these challenges, we propose Generative Ranking for Ads at Baidu (GRAB), an end-to-end sequential training and inference framework for industrial-grade CTR prediction. GRAB introduces three core innovations to reconcile the demands for performance, efficiency, and training stability: @@ -42,290 +43,8 @@ GR. Recent GR work models recommendation as causal Transformer-based sequential #### 3.1. DLRMs -The traditional DLRM architecture, as shown in Fig. 1, follows a modular processing pipeline for CTR prediction, handling raw features from users, candidate ads, and contextual signals. The pipeline involves: (a) expanding categorical features into fixed fields via feature engineering, (b) mapping these fields through hashing to obtain discrete ID vectors for embedding lookup in a Sparse Parameter Server Table (PSTable), and (c) concatenating and normalizing the retrieved embeddings to form a fixed-length flattened vector. This unified representation is then fed into an MLP, typically enhanced with a gating network, to model high-orderfeature interactions and generate the final CTR prediction. - -





| Model | AUC |
| DIN | 0.83309 |
| SIM Soft | 0.83520 |
| TWIN | 0.83556 |
| HSTU | 0.83590 |
| LONGER | 0.83615 |
| GRAB-small | 0.83661 |
| GRAB-standard | 0.83772 |
| Model | Params | Setting |
| GRAB $ _{2l-2h-64d} $ | 6.51M | $ n_{layer}=2 $, $ n_{head}=2 $, $ d_{model}=64 $ |
| GRAB $ _{4l-2h-64d} $ | 6.67M | $ n_{layer}=4 $, $ n_{head}=2 $, $ d_{model}=64 $ |
| GRAB $ _{6l-2h-64d} $ | 6.83M | $ n_{layer}=6 $, $ n_{head}=2 $, $ d_{model}=64 $ |
| GRAB $ _{2l-4h-64d} $ | 6.48M | $ n_{layer}=2 $, $ n_{head}=4 $, $ d_{model}=64 $ |
| GRAB $ _{4l-4h-64d} $ | 6.63M | $ n_{layer}=4 $, $ n_{head}=4 $, $ d_{model}=64 $ |
| GRAB $ _{4l-4h-128d} $ | 7.05M | $ n_{layer}=4 $, $ n_{head}=4 $, $ d_{model}=128 $ |
| GRAB $ _{4l-4h-256d} $ | 8.13M | $ n_{layer}=4 $, $ n_{head}=4 $, $ d_{model}=256 $ |
| GRAB $ _{4l-4h-512d} $ | 11.27M | $ n_{layer}=4 $, $ n_{head}=4 $, $ d_{model}=512 $ |
| Model | AUC |
| GRAB | 0.83772 |
| GRAB w/ Partial Token | 0.83492 |
| GRAB w/ Full Token | 0.83749 |
| GRAB w/o relative pos | 0.83768 |
| GRAB w/o relative time | 0.83743 |
| GRAB w/o relative action | 0.83724 |
| GRAB w/o Multi-channel | 0.83743 |
| GRAB w/o Target-token mix | 0.83768 |
| GRAB_sparse | 0.83614 |
| GRAB_sparse w/o STS | 0.83549 |










| Model | AUC |
| DIN | 0.83309 |
| SIM Soft | 0.83520 |
| TWIN | 0.83556 |
| HSTU | 0.83590 |
| LONGER | 0.83615 |
| GRAB-small | 0.83661 |
| GRAB-standard | 0.83772 |
| Model | Params | Setting |
| GRAB $ _{2l-2h-64d} $ | 6.51M | $ n_{layer}=2 $, $ n_{head}=2 $, $ d_{model}=64 $ |
| GRAB $ _{4l-2h-64d} $ | 6.67M | $ n_{layer}=4 $, $ n_{head}=2 $, $ d_{model}=64 $ |
| GRAB $ _{6l-2h-64d} $ | 6.83M | $ n_{layer}=6 $, $ n_{head}=2 $, $ d_{model}=64 $ |
| GRAB $ _{2l-4h-64d} $ | 6.48M | $ n_{layer}=2 $, $ n_{head}=4 $, $ d_{model}=64 $ |
| GRAB $ _{4l-4h-64d} $ | 6.63M | $ n_{layer}=4 $, $ n_{head}=4 $, $ d_{model}=64 $ |
| GRAB $ _{4l-4h-128d} $ | 7.05M | $ n_{layer}=4 $, $ n_{head}=4 $, $ d_{model}=128 $ |
| GRAB $ _{4l-4h-256d} $ | 8.13M | $ n_{layer}=4 $, $ n_{head}=4 $, $ d_{model}=256 $ |
| GRAB $ _{4l-4h-512d} $ | 11.27M | $ n_{layer}=4 $, $ n_{head}=4 $, $ d_{model}=512 $ |
| Model | AUC |
| GRAB | 0.83772 |
| GRAB w/ Partial Token | 0.83492 |
| GRAB w/ Full Token | 0.83749 |
| GRAB w/o relative pos | 0.83768 |
| GRAB w/o relative time | 0.83743 |
| GRAB w/o relative action | 0.83724 |
| GRAB w/o Multi-channel | 0.83743 |
| GRAB w/o Target-token mix | 0.83768 |
| GRAB_sparse | 0.83614 |
| GRAB_sparse w/o STS | 0.83549 |






| Symbol | Description |
| $ \Psi_{k}(t_{j}) $ | The k-th training example (k is ordered globally) emitted by the feature logging system at time $ t_{j} $. In a typical DLRM recommendation system, after the user consumes some content $ \Phi_{i} $ (by responding with an action $ a_{i} $ such as skip, video completion and share), the feature logging system joins the tuple $ (\Phi_{i}, a_{i}) $ with the features used to rank $ \Phi_{i} $, and emits $ (\Phi_{i}, a_{i}) $ features for $ \Phi_{i} $ as a training example $ \Psi_{k}(t_{j}) $. As discussed in Section 2.3, DLRMs and GRs deal with different numbers of training examples, with the number of examples in GRs typically being 1-2 orders of magnitude smaller. |
| $ n_{c}(n_{c,i}) $ | Number of contents that user has interacted with (of user/sample i). |
| $ \Phi_{0}, \dots, \Phi_{n_{c}-1} $ | List of contents that a user has interacted with, in the context of a recommendation system. List of user actions corresponding to $ \Phi_{i} $s. When all predicted events are binary, each action can be considered a multi-hot vector over (atomic) events such as like, share, comment, image view, video initialization, video completion, hide, etc. |
| $ a_{0}, \dots, a_{n_{c}-1} $ | List of user actions corresponding to the value of $ a_{0} $, the value of $ a_{1} $, the value of $ a_{2} $, the value of $ a_{3} $, the value of $ a_{4} $, the value of $ a_{5} $, the value of $ a_{6} $, the value of $ a_{7} $, the value of $ a_{8} $, the value of $ a_{9} $, the value of $ a_{10} $, the value of $ a_{11} $, the value of $ a_{12} $, the value of $ a_{13} $, the value of $ a_{14} $, the value of $ a_{15} $, the value of $ a_{16} $, the value of $ a_{17} $, the value of $ a_{18} $, the value of $ a_{19} $, the value of $ a_{20} $, the value of $ a_{21} $, the value of $ a_{22} $, the value of $ a_{23} $, the value of $ a_{24} $, the value of $ a_{25} $, the value of $ a_{26} $, the value of $ a_{27} $, the value of $ a_{28} $, the value of $ a_{29} $, the value of $ a_{30} $, the value of $ a_{31} $, the value of $ a_{32} $, the value of $ a_{33} $, the value of $ a_{34} $, the value of $ a_{35} $, the value of $ a_{36} $, the value of $ a_{37} $, the value of $ a_{38} $, the value of $ a_{39} $, the value of $ a_{40} $, the value of $ a_{41} $, the value of $ a_{42} $, the value of $ a_{43} $, the value of $ a_{44} $, the value of $ a_{45} $, the value of $ a_{46} $, the value of $ a_{47} $, the value of $ a_{48} $, the value of $ a_{49} $, the value of $ a_{50} $, the value of $ a_{51} $, the value of $ a_{52} $, the value of $ a_{53} $, the value of $ a_{54} $, the value of $ a_{55} $, the value of $ a_{56} $, the value of $ a_{57} $, the value of $ a_{58} $, the value of $ a_{59} $, the value of $ a_{60} $, the value of $ a_{61} $, the value of $ a_{62} $, the value of $ a_{63} $, the value of $ a_{64} $, the value of $ a_{65} $, the value of $ a_{66} $, the value of $ a_{67} $, the value of $ a_{68} $, the value of $ a_{69} $, the value of $ a_{70} $, the value of $ a_{71} $, the value of $ a_{72} $, the value of $ a_{73} $, the value of $ a_{74} $, the value of $ a_{75} $, the value of $ a_{76} $, the value of $ a_{77} $, the value of $ a_{78} $, the value of $ a_{79} $, the value of $ a_{80} $, the value of $ a_{81} $, the value of $ a_{82} $, the value of $ a_{83} $, the value of $ a_{84} $, the value of $ a_{85} $, the value of $ a_{86} $, the value of $ a_{87} $, the value of $ a_{88} $, the value of $ a_{89} $, the value of $ a_{90} $, the value of $ a_{91} $, the value of $ a_{92} $, the value of $ a_{93} $, the value of $ a_{94} $, the value of $ a_{95} $, the value of $ a_{96} $, the value of $ a_{97} $, the value of $ a_{98} $, the value of $ a_{99} $, the value of $ a_{10} $, the value of $ a_{11} $, the value of $ a_{12} $, the value of $ a_{13} $, the value of $ a_{14} $, the value of $ a_{15} $, the value of $ a_{16} $, the value of $ a_{17} $, the value of $ a_{18} $, the value of $ a_{19} $, the value of $ a_{20} $, the value of $ a_{21} $, the value of $ a_{22} $, the value of $ a_{23} $, the value of $ a_{24} $, the value of $ a_{25} $, the value of $ a_{26} $, the value of $ a_{27} $, the value of $ a_{28} $, the value of $ a_{29} $, the value of $ a_{30} $, the value of $ a_{31} $, the value of $ a_{32} $, the value of $ a_{33} $, the value of $ a_{34} $, the value of $ a_{35} $, the value of $ a_{36} $, the value of $ a_{37} $, the value of $ a_{38} $, the value of $ a_{39} $, the value of $ a_{40} $, the value of $ a_{41} $, the value of $ a_{42} $, the value of $ a_{43} $, the value of $ a_{44} $, the value of $ a_{45} $, the value of $ a_{46} $, the value of $ a_{47} $, the value of $ a_{48} $, the value of $ a_{49} $, the value of $ a_{50} $, the value of $ a_{51} $, the value of $ a_{52} $, the value of $ a_{53} $, the value of $ a_{54} $, the value of $ a_{55} $, the value of $ a_{56} $, the value of $ a_{57} $, the value of $ a_{58} $, the value of $ a_{59} $, the value of $ a_{60} $, the value of $ a_{61} $, the value of $ a_{62} $, the value of $ a_{63} $, the value of $ a_{64} $, the value of $ a_{65} $, the value of $ a_{66} $, the value of $ a_{67} $, the value of $ a_{68} $, the value of $ a_{69} $, the value of $ a_{70} $, the value of $ a_{71} $, the value of $ a_{72} $, the value of $ a_{73} $, the value of $ a_{74} $, the value of $ a_{75} $, the value of $ a_{76} $, the value of $ a_{77} $, the value of $ a_{78} $, the value of $ a_{79} $, the value of $ a_{80} $, the value of $ a_{81} $, the value of $ a_{82} $, the value of $ a_{83} $, the value of $ a_{84} $, the value of $ a_{85} $, the value of $ a_{86} $, the value of $ a_{87} $, the value of $ a_{88} $, the value of $ a_{89} $, the value of $ a_{90} $, the value of $ a_{91} $, the value of $ a_{92} $, the value of $ a_{93} $, the value of $ a_{94} $, the value of $ a_{95} $, the value of $ a_{96} $, the value of $ a_{97} $, the value of $ a_{98} $, the value of $ a_{99} $, the value of $ a_{10} $, the value of $ a_{11} $, the value of $ a_{12} $, the value of $ a_{13} $, the value of $ a_{14} $, the value of $ a_{15} $, the value of $ a_{16} $, the value of $ a_{17} $, the value of $ a_{18} $, the value of $ a_{19} $, the value of $ a_{20} $, the value of $ a_{21} $, the value of $ a_{22} $, the value of $ a_{23} $, the value of $ a_{24} $, the value of $ a_{25} $, the value of $ a_{26} $, the value of $ a_{27} $, the value of $ a_{28} $, the value of $ a_{29} $, the value of $ a_{30} $, the value of $ a_{31} $, the value of $ a_{32} $, the value of $ a_{33} $, the value of $ a_{34} $, the value of $ a_{35} $, the value of $ a_{36} $, the value of $ a_{37} $, the value of $ a_{38} $, the value of $ a_{39} $, the value of $ a_{40} $, the value of $ a_{41} $, the value of $ a_{42} $, the value of $ a_{43} $, the value of $ a_{44} $, the value of $ a_{45} $, the value of $ a_{46} $, the value of $ a_{47} $, the value of $ a_{48} $, the value of $ a_{49} $, the value of $ a_{50} $, the value of $ a_{51} $, the value of $ a_{52} $, the value of $ a_{53} $, the value of $ a_{54} $, the value of $ a_{55} $, the value of $ a_{56} $, the value of $ a_{57} $, the value of $ a_{58} $, the value of $ a_{59} $, the value of $ a_{60} $, the value of $ a_{61} $, the value of $ a_{62} $, the value of $ a_{63} $, the value of $ a_{64} $, the value of $ a_{65} $, the value of $ a_{66} $, the value of $ a_{67} $, the value of $ a_{68} $, the value of $ a_{69} $, the value of $ a_{70} $, the value of $ a_{71} $, the value of $ a_{72} $, the value of $ a_{73} $, the value of $ a_{74} $, the value of $ a_{75} $, the value of $ a_{76} $, the value of $ a_{77} $, the value of $ a_{78} $, the value of $ a_{79} $, the value of $ a_{80} $, the value of $ a_{81} $, the value of $ a_{82} $, the value of $ a_{83} $, the value of $ a_{84} $, the value of $ a_{85} $, the value of $ a_{86} $, the value of $ a_{87} $, the value of $ a_{88} $, the value of $ a_{89} $, the value of $ a_{90} $, the value of $ a_{91} $, the value of $ a_{92} $, the value of $ a_{93} $, the value of $ a_{94} $, the value of $ a_{95} $, the value of $ a_{96} $, the value of $ a_{97} $, the value of $ a_{98} $, the value of $ a_{99} $, the value of $ a_{10} $, the value of $ a_{11} $, the value of $ a_{12} $, the value of $ a_{13} $, the value of $ a_{14} $, the value of $ a_{15} $, the value of $ a_{16} $, the value of $ a_{17} $, the value of $ a_{18} $, the value of $ a_{19} $, the value of $ a_{20} $, the value of $ a_{21} $, the value of $ a_{22} $, the value of $ a_{23} $, the value of $ a_{24} $, the value of $ a_{25} $, the value of $ a_{26} $, the value of $ a_{27} $, the value of $ a_{28} $, the value of $ a_{29} $, the value of $ a_{30} $, the value of $ a_{31} $, the value of $ a_{32} $, the value of $ a_{33} $, the value of $ a_{34} $, the value of $ a_{35} $, the value of $ a_{36} $, the value of $ a_{37} $, the value of $ a_{38} $, the value of $ a_{39} $, the value of $ a_{40} $, the value of $ a_{41} $, the value of $ a_{42} $, the value of $ a_{43} $, the value of $ a_{44} $, the value of $ a_{45} $, the value of $ a_{46} $, the value of $ a_{47} $, the value of $ a_{48} $, the value of $ a_{49} $, the value of $ a_{50} $, the value of $ a_{51} $, the value of $ a_{52} $, the value of $ a_{53} $, the value of $ a_{54} $, the value of $ a_{55} $, the value of $ a_{56} $, the value of $ a_{57} $, the value of $ a_{58} $, the value of $ a_{59} $, the value of $ a_{60} $, the value of $ a_{61} $, the value of $ a_{62} $, the value of $ a_{63} $, the value of $ a_{64} $, the value of $ a_{65} $, the value of $ a_{66} $, the value of $ a_{67} $, the value of $ a_{68} $, the value of $ a_{69} $, the value of $ a_{70} $, the value of $ a_{71} $, the value of $ a_{72} $, the value of $ a_{73} $, the value of $ a_{74} $, the value of $ a_{75} $, the value of $ a_{76} $, the value of $ a_{77} $, the value of $ a_{78} $, the value of $ a_{79} $, the value of $ a_{80} $, the value of $ a_{81} $, the value of $ a_{82} $, the value of $ a_{83} $, the value of $ a_{84} $, the value of $ a_{85} $, the value of $ a_{86} $, the value of $ a_{87} $, the value of $ a_{88} $, the value of $ a_{89} $, the value of $ a_{90} $, the value of $ a_{91} $, the value of $ a_{92} $, the value of $ a_{93} $, the value of $ a_{94} $, the value of $ a_{95} $, the value of $ a_{96} $, the value of $ a_{97} $, the value of $ a_{98} $, the value of $ a_{99} $, the value of $ a_{10} $, the value of $ a_{11} $, the value of $ a_{12} $, the value of $ a_{13} $, the value of $ a_{14} $, the value |
| Symbol | Description |
| X | Input to an HSTU layer. In standard terminology (before batching), $ X \in \mathbb{R}^{N \times d} $ assuming we have a input sequence containing N tokens. |
| $ Q(X) $, $ K(X) $, $ V(X) $ | Query, key, value in HSTU obtained for a given input X based on Equation (1). The definition is similar to Q, K, and V in standard Transformers. $ Q(X) $, $ K(X) \in \mathbb{R}^{h \times N \times d_{qk}} $, and $ V(X) \in \mathbb{R}^{h \times N \times d_v} $. |
| $ U(X) $ | HSTU uses $ U(X) $ to “gate” attention-pooled values ( $ V(X) $) in Equation (3), which together with $ f_2(\cdot) $, enables HSTU to avoid feedforward layers altogether. $ U(X) \in \mathbb{R}^{h \times N \times d_v} $. |
| $ A(X) $ | Attention tensor obtained for input X. $ A(X) \in \mathbb{R}^{h \times N \times N} $. |
| $ Y(X) $ | Output of a HSTU layer obtained for the input X. $ Y(X) \in \mathbb{R}^{d} $. |
| Split( $ \cdot $) | The operation that splits a tensor into chunks. $ \phi_1(f_1(X)) \in \mathbb{R}^{N \times (2hd_{qk} + 2hd_v)} $ in Equation (1); we obtain $ U(X) $, $ V(X) $ (both of shape $ h \times N \times d_v $), $ Q(X) $, $ K(X) $ (both of shape $ h \times N \times d_{qk} $) by splitting the larger tensor (and permitting dimensions) with $ U(X) $, $ V(X) $, $ Q(X) $, $ K(X) = \text{Split}(\phi_1(f_1(X))) $. |
| $ \text{rab}^{p,t} $ | relative attention bias that incorporates both positional (Raffel et al., 2020) and temporal information (based on the time when the tokens are observed, $ t_0, \ldots, t_{n-1} $; one possible implementation is to apply some bucketization function to $ (t_j - t_i) $ for $ (i, j) $). In practice, we share $ \text{rab}^{p,t} $ across different attention heads within a layer, hence $ \text{rab}^{p,t} \in \mathbb{R}^{1 \times N \times N} $. |
| $ \alpha $ | Parameter controlling sparsity in the Stochastic Length algorithm used in HSTU (Section 3.2). |
| $ R $ | Register size on GPUs, in the context of the HSTU algorithm discussed in Section 3.2. |
| m | Number of candidates considered in a recommendation system's ranking stage. |
| $ b_m $ | Microbatch size, in the M-FALCON algorithm discussed in Section 3.4. |
| Input for target item $ i $ | Expected output for target item $ i $ | Architecture | Training Procedure | |
| GRs | $ \Phi_0, a_0, \Phi_1, a_1, ..., \Phi_i $ | $ a_i $ (target-aware) | Self-attention (HSTU) | Causal autoregressive (streaming/single-pass) |
| GRU4Rec\nSASRec | $ \Phi_0, \Phi_1, ..., \Phi_{i-1} $ | $ \Phi_i $ | RNNs (GRUs)\nSelf-attention (Transformers) | Causal autoregressive (multi-pass) |
| BERT4Rec\nS3Rec | $ \Phi_0, \Phi_1, ..., \Phi_{i-1} $\n(at inference time) | $ \Phi_i $ | Self-attention (Transformers) | Sequential multi-pass $ ^6 $ |
| DIN\nBST\nTWIN\nTransAct | $ \Phi_0, \Phi_1, ..., \Phi_i $\n $ (\Phi_0, a_0), ..., (\Phi_{i-1}, a_{i-1}), \Phi_i $ | $ a_i $ (target aware, implicitly as part of DLRMs) | Pairwise attention\nSelf-attention (Transformers)\nTwo-stage pairwise attention\nSelf-attention (Transformers) | Pointwise (generally streaming/single pass) |
| Task | Specification (Inputs / Outputs / Length) |
| $ x_{i}s $ | $ \Phi_{0}, a_{0}, \Phi_{1}, a_{1}, ..., \Phi_{n_{c}-2}, a_{n_{c}-2}, \Phi_{n_{c}-1}, a_{n_{c}-1} $ |
| $ y_{i}s $ | $ a_{0}, \varnothing, a_{1}, \varnothing, ..., a_{n_{c}-2}, \varnothing, a_{n_{c}-1}, \varnothing $ |
| $ n $ | $ 2n_{c} $ |
| $ x_{i}s $ | $ \Phi_{0}, a_{0}, \Phi_{1}, a_{1}, ..., \Phi_{n_{c}-2}, a_{n_{c}-2}, \Phi_{n_{c}-1}, a_{n_{c}-1} $ |
| $ y_{i}s $ | $ \varnothing, \Phi_{1}, \varnothing, \Phi_{2}, ..., \varnothing, \Phi_{n_{c}-1}, \varnothing, \varnothing $ |
| $ n $ | $ 2n_{c} $ |

| Method | HR@10 | HR@50 | HR@200 | NDCG@10 | NDCG@200 | |
| ML-1M | SASRec (2023) | .2853 | .5474 | .7528 | .1603 | .2498 |
| BERT4Rec | .2843 (-0.4%) | - | - | .1537 (-4.1%) | - | |
| GRU4Rec | .2811 (-1.5%) | - | - | .1648 (+2.8%) | - | |
| HSTU | .3097 (+8.6%) | .5754 (+5.1%) | .7716 (+2.5%) | .1720 (+7.3%) | .2606 (+4.3%) | |
| HSTU-large | .3294 (+15.5%) | .5935 (+8.4%) | .7839 (+4.1%) | .1893 (+18.1%) | .2771 (+10.9%) | |
| ML-20M | SASRec (2023) | .2906 | .5499 | .7655 | .1621 | .2521 |
| BERT4Rec | .2816 (-3.4%) | - | - | .1703 (+5.1%) | - | |
| GRU4Rec | .2813 (-3.2%) | - | - | .1730 (+6.7%) | - | |
| HSTU | .3252 (+11.9%) | .5885 (+7.0%) | .7943 (+3.8%) | .1878 (+15.9%) | .2774 (+10.0%) | |
| HSTU-large | .3567 (+22.8%) | .6149 (+11.8%) | .8076 (+5.5%) | .2106 (+30.0%) | .2971 (+17.9%) | |
| Books | SASRec (2023) | .0292 | .0729 | .1400 | .0156 | .0350 |
| HSTU | .0404 (+38.4%) | .0943 (+29.5%) | .1710 (+22.1%) | .0219 (+40.6%) | .0450 (+28.6%) | |
| HSTU-large | .0469 (+60.6%) | .1066 (+46.2%) | .1876 (+33.9%) | .0257 (+65.8%) | .0508 (+45.1%) |


| Metric Name | Selection Type | ||
| Greedy | Weighted | Random | |
| Main Engagement Metric (NE) | 0.495 | 0.494 | 0.495 |
| Main Consumption Metric (NE) | 0.792 | 0.789 | 0.791 |
| Alpha | Max Sequence Length | |||||||
| 1,024 | 2,048 | 4,096 | 8,192 | |||||
| sparsity | s2 | sparsity | s2 | sparsity | s2 | sparsity | s2 | |
| 1.6 | 71.5% | 89.4% | 75.8% | 92.3% | 79.4% | 94.7% | 83.8% | 97.3% |
| 1.7 | 57.3% | 77.6% | 60.6% | 79.8% | 67.3% | 86.6% | 74.5% | 93.3% |
| 1.8 | 37.5% | 56.2% | 42.6% | 62.1% | 51.9% | 74.2% | 62.6% | 85.5% |
| 1.9 | 15.0% | 25.2% | 17.7% | 29.0% | 29.6% | 47.5% | 57.8% | 80.9% |
| 2.0 | 1.2% | 1.7% | 2.5% | 3.5% | 18.9% | 30.8% | 57.6% | 80.6% |
| Alpha | Max Sequence Length | |||||||
| 1,024 | 2,048 | 4,096 | 8,192 | |||||
| sparsity | s2 | sparsity | s2 | sparsity | s2 | sparsity | s2 | |
| 1.6 | 68.0% | 85.0% | 74.6% | 90.8% | 78.6% | 93.5% | 83.5% | 97.3% |
| 1.7 | 56.3% | 76.1% | 61.2% | 80.6% | 67.5% | 87.0% | 74.3% | 93.3% |
| 1.8 | 38.9% | 58.3% | 42.0% | 61.3% | 50.4% | 72.4% | 61.0% | 84.4% |
| 1.9 | 16.2% | 27.3% | 17.3% | 28.6% | 27.2% | 44.4% | 54.3% | 77.8% |
| 2.0 | 0.9% | 1.2% | 1.6% | 2.1% | 13.5% | 22.5% | 54.0% | 77.4% |




| Evaluation Strategy | Average NE Difference vs Full Sequence Baseline | ||
| Model Type | 2048 / 52% Sparsity | 4096 / 75% Sparsity | |
| Zero-shot | HSTU (Raffel et al., 2020) | 6.46% | 10.35% |
| HSTU-RoPE (Peng et al., 2024) | 7.51% | 11.27% | |
| Fine-tune | HSTU (Raffel et al., 2020) | 1.92% | 2.21% |
| HSTU-RoPE (Peng et al., 2024) | 1.61% | 2.19% | |
| Stochastic Length (SL) | HSTU | 0.098% | 0.64% |






| Architecture | Retrieval log pplx. | Ranking (NE) | |
| E-Task | C-Task | ||
| Transformers | 4.069 | NaN | NaN |
| HSTU ( $ -rab^{{p,t}} $, Softmax) | 4.024 | .5067 | .7931 |
| HSTU ( $ -rab^{{p,t}} $) | 4.021 | .4980 | .7860 |
| Transformer++ | 4.015 | .4945 | .7822 |
| HSTU (original rab) | 4.029 | .4941 | .7817 |
| HSTU | 3.978 | .4937 | .7805 |


















| Symbol | Description |
| $ \Psi_{k}(t_{j}) $ | The k-th training example (k is ordered globally) emitted by the feature logging system at time $ t_{j} $. In a typical DLRM recommendation system, after the user consumes some content $ \Phi_{i} $ (by responding with an action $ a_{i} $ such as skip, video completion and share), the feature logging system joins the tuple $ (\Phi_{i}, a_{i}) $ with the features used to rank $ \Phi_{i} $, and emits $ (\Phi_{i}, a_{i}) $ features for $ \Phi_{i} $ as a training example $ \Psi_{k}(t_{j}) $. As discussed in Section 2.3, DLRMs and GRs deal with different numbers of training examples, with the number of examples in GRs typically being 1-2 orders of magnitude smaller. |
| $ n_{c}(n_{c,i}) $ | Number of contents that user has interacted with (of user/sample i). |
| $ \Phi_{0}, \dots, \Phi_{n_{c}-1} $ | List of contents that a user has interacted with, in the context of a recommendation system. List of user actions corresponding to $ \Phi_{i} $s. When all predicted events are binary, each action can be considered a multi-hot vector over (atomic) events such as like, share, comment, image view, video initialization, video completion, hide, etc. |
| $ a_{0}, \dots, a_{n_{c}-1} $ | List of user actions corresponding to the value of $ a_{0} $, the value of $ a_{1} $, the value of $ a_{2} $, the value of $ a_{3} $, the value of $ a_{4} $, the value of $ a_{5} $, the value of $ a_{6} $, the value of $ a_{7} $, the value of $ a_{8} $, the value of $ a_{9} $, the value of $ a_{10} $, the value of $ a_{11} $, the value of $ a_{12} $, the value of $ a_{13} $, the value of $ a_{14} $, the value of $ a_{15} $, the value of $ a_{16} $, the value of $ a_{17} $, the value of $ a_{18} $, the value of $ a_{19} $, the value of $ a_{20} $, the value of $ a_{21} $, the value of $ a_{22} $, the value of $ a_{23} $, the value of $ a_{24} $, the value of $ a_{25} $, the value of $ a_{26} $, the value of $ a_{27} $, the value of $ a_{28} $, the value of $ a_{29} $, the value of $ a_{30} $, the value of $ a_{31} $, the value of $ a_{32} $, the value of $ a_{33} $, the value of $ a_{34} $, the value of $ a_{35} $, the value of $ a_{36} $, the value of $ a_{37} $, the value of $ a_{38} $, the value of $ a_{39} $, the value of $ a_{40} $, the value of $ a_{41} $, the value of $ a_{42} $, the value of $ a_{43} $, the value of $ a_{44} $, the value of $ a_{45} $, the value of $ a_{46} $, the value of $ a_{47} $, the value of $ a_{48} $, the value of $ a_{49} $, the value of $ a_{50} $, the value of $ a_{51} $, the value of $ a_{52} $, the value of $ a_{53} $, the value of $ a_{54} $, the value of $ a_{55} $, the value of $ a_{56} $, the value of $ a_{57} $, the value of $ a_{58} $, the value of $ a_{59} $, the value of $ a_{60} $, the value of $ a_{61} $, the value of $ a_{62} $, the value of $ a_{63} $, the value of $ a_{64} $, the value of $ a_{65} $, the value of $ a_{66} $, the value of $ a_{67} $, the value of $ a_{68} $, the value of $ a_{69} $, the value of $ a_{70} $, the value of $ a_{71} $, the value of $ a_{72} $, the value of $ a_{73} $, the value of $ a_{74} $, the value of $ a_{75} $, the value of $ a_{76} $, the value of $ a_{77} $, the value of $ a_{78} $, the value of $ a_{79} $, the value of $ a_{80} $, the value of $ a_{81} $, the value of $ a_{82} $, the value of $ a_{83} $, the value of $ a_{84} $, the value of $ a_{85} $, the value of $ a_{86} $, the value of $ a_{87} $, the value of $ a_{88} $, the value of $ a_{89} $, the value of $ a_{90} $, the value of $ a_{91} $, the value of $ a_{92} $, the value of $ a_{93} $, the value of $ a_{94} $, the value of $ a_{95} $, the value of $ a_{96} $, the value of $ a_{97} $, the value of $ a_{98} $, the value of $ a_{99} $, the value of $ a_{10} $, the value of $ a_{11} $, the value of $ a_{12} $, the value of $ a_{13} $, the value of $ a_{14} $, the value of $ a_{15} $, the value of $ a_{16} $, the value of $ a_{17} $, the value of $ a_{18} $, the value of $ a_{19} $, the value of $ a_{20} $, the value of $ a_{21} $, the value of $ a_{22} $, the value of $ a_{23} $, the value of $ a_{24} $, the value of $ a_{25} $, the value of $ a_{26} $, the value of $ a_{27} $, the value of $ a_{28} $, the value of $ a_{29} $, the value of $ a_{30} $, the value of $ a_{31} $, the value of $ a_{32} $, the value of $ a_{33} $, the value of $ a_{34} $, the value of $ a_{35} $, the value of $ a_{36} $, the value of $ a_{37} $, the value of $ a_{38} $, the value of $ a_{39} $, the value of $ a_{40} $, the value of $ a_{41} $, the value of $ a_{42} $, the value of $ a_{43} $, the value of $ a_{44} $, the value of $ a_{45} $, the value of $ a_{46} $, the value of $ a_{47} $, the value of $ a_{48} $, the value of $ a_{49} $, the value of $ a_{50} $, the value of $ a_{51} $, the value of $ a_{52} $, the value of $ a_{53} $, the value of $ a_{54} $, the value of $ a_{55} $, the value of $ a_{56} $, the value of $ a_{57} $, the value of $ a_{58} $, the value of $ a_{59} $, the value of $ a_{60} $, the value of $ a_{61} $, the value of $ a_{62} $, the value of $ a_{63} $, the value of $ a_{64} $, the value of $ a_{65} $, the value of $ a_{66} $, the value of $ a_{67} $, the value of $ a_{68} $, the value of $ a_{69} $, the value of $ a_{70} $, the value of $ a_{71} $, the value of $ a_{72} $, the value of $ a_{73} $, the value of $ a_{74} $, the value of $ a_{75} $, the value of $ a_{76} $, the value of $ a_{77} $, the value of $ a_{78} $, the value of $ a_{79} $, the value of $ a_{80} $, the value of $ a_{81} $, the value of $ a_{82} $, the value of $ a_{83} $, the value of $ a_{84} $, the value of $ a_{85} $, the value of $ a_{86} $, the value of $ a_{87} $, the value of $ a_{88} $, the value of $ a_{89} $, the value of $ a_{90} $, the value of $ a_{91} $, the value of $ a_{92} $, the value of $ a_{93} $, the value of $ a_{94} $, the value of $ a_{95} $, the value of $ a_{96} $, the value of $ a_{97} $, the value of $ a_{98} $, the value of $ a_{99} $, the value of $ a_{10} $, the value of $ a_{11} $, the value of $ a_{12} $, the value of $ a_{13} $, the value of $ a_{14} $, the value of $ a_{15} $, the value of $ a_{16} $, the value of $ a_{17} $, the value of $ a_{18} $, the value of $ a_{19} $, the value of $ a_{20} $, the value of $ a_{21} $, the value of $ a_{22} $, the value of $ a_{23} $, the value of $ a_{24} $, the value of $ a_{25} $, the value of $ a_{26} $, the value of $ a_{27} $, the value of $ a_{28} $, the value of $ a_{29} $, the value of $ a_{30} $, the value of $ a_{31} $, the value of $ a_{32} $, the value of $ a_{33} $, the value of $ a_{34} $, the value of $ a_{35} $, the value of $ a_{36} $, the value of $ a_{37} $, the value of $ a_{38} $, the value of $ a_{39} $, the value of $ a_{40} $, the value of $ a_{41} $, the value of $ a_{42} $, the value of $ a_{43} $, the value of $ a_{44} $, the value of $ a_{45} $, the value of $ a_{46} $, the value of $ a_{47} $, the value of $ a_{48} $, the value of $ a_{49} $, the value of $ a_{50} $, the value of $ a_{51} $, the value of $ a_{52} $, the value of $ a_{53} $, the value of $ a_{54} $, the value of $ a_{55} $, the value of $ a_{56} $, the value of $ a_{57} $, the value of $ a_{58} $, the value of $ a_{59} $, the value of $ a_{60} $, the value of $ a_{61} $, the value of $ a_{62} $, the value of $ a_{63} $, the value of $ a_{64} $, the value of $ a_{65} $, the value of $ a_{66} $, the value of $ a_{67} $, the value of $ a_{68} $, the value of $ a_{69} $, the value of $ a_{70} $, the value of $ a_{71} $, the value of $ a_{72} $, the value of $ a_{73} $, the value of $ a_{74} $, the value of $ a_{75} $, the value of $ a_{76} $, the value of $ a_{77} $, the value of $ a_{78} $, the value of $ a_{79} $, the value of $ a_{80} $, the value of $ a_{81} $, the value of $ a_{82} $, the value of $ a_{83} $, the value of $ a_{84} $, the value of $ a_{85} $, the value of $ a_{86} $, the value of $ a_{87} $, the value of $ a_{88} $, the value of $ a_{89} $, the value of $ a_{90} $, the value of $ a_{91} $, the value of $ a_{92} $, the value of $ a_{93} $, the value of $ a_{94} $, the value of $ a_{95} $, the value of $ a_{96} $, the value of $ a_{97} $, the value of $ a_{98} $, the value of $ a_{99} $, the value of $ a_{10} $, the value of $ a_{11} $, the value of $ a_{12} $, the value of $ a_{13} $, the value of $ a_{14} $, the value of $ a_{15} $, the value of $ a_{16} $, the value of $ a_{17} $, the value of $ a_{18} $, the value of $ a_{19} $, the value of $ a_{20} $, the value of $ a_{21} $, the value of $ a_{22} $, the value of $ a_{23} $, the value of $ a_{24} $, the value of $ a_{25} $, the value of $ a_{26} $, the value of $ a_{27} $, the value of $ a_{28} $, the value of $ a_{29} $, the value of $ a_{30} $, the value of $ a_{31} $, the value of $ a_{32} $, the value of $ a_{33} $, the value of $ a_{34} $, the value of $ a_{35} $, the value of $ a_{36} $, the value of $ a_{37} $, the value of $ a_{38} $, the value of $ a_{39} $, the value of $ a_{40} $, the value of $ a_{41} $, the value of $ a_{42} $, the value of $ a_{43} $, the value of $ a_{44} $, the value of $ a_{45} $, the value of $ a_{46} $, the value of $ a_{47} $, the value of $ a_{48} $, the value of $ a_{49} $, the value of $ a_{50} $, the value of $ a_{51} $, the value of $ a_{52} $, the value of $ a_{53} $, the value of $ a_{54} $, the value of $ a_{55} $, the value of $ a_{56} $, the value of $ a_{57} $, the value of $ a_{58} $, the value of $ a_{59} $, the value of $ a_{60} $, the value of $ a_{61} $, the value of $ a_{62} $, the value of $ a_{63} $, the value of $ a_{64} $, the value of $ a_{65} $, the value of $ a_{66} $, the value of $ a_{67} $, the value of $ a_{68} $, the value of $ a_{69} $, the value of $ a_{70} $, the value of $ a_{71} $, the value of $ a_{72} $, the value of $ a_{73} $, the value of $ a_{74} $, the value of $ a_{75} $, the value of $ a_{76} $, the value of $ a_{77} $, the value of $ a_{78} $, the value of $ a_{79} $, the value of $ a_{80} $, the value of $ a_{81} $, the value of $ a_{82} $, the value of $ a_{83} $, the value of $ a_{84} $, the value of $ a_{85} $, the value of $ a_{86} $, the value of $ a_{87} $, the value of $ a_{88} $, the value of $ a_{89} $, the value of $ a_{90} $, the value of $ a_{91} $, the value of $ a_{92} $, the value of $ a_{93} $, the value of $ a_{94} $, the value of $ a_{95} $, the value of $ a_{96} $, the value of $ a_{97} $, the value of $ a_{98} $, the value of $ a_{99} $, the value of $ a_{10} $, the value of $ a_{11} $, the value of $ a_{12} $, the value of $ a_{13} $, the value of $ a_{14} $, the value |
| Symbol | Description |
| X | Input to an HSTU layer. In standard terminology (before batching), $ X \in \mathbb{R}^{N \times d} $ assuming we have a input sequence containing N tokens. |
| $ Q(X) $, $ K(X) $, $ V(X) $ | Query, key, value in HSTU obtained for a given input X based on Equation (1). The definition is similar to Q, K, and V in standard Transformers. $ Q(X) $, $ K(X) \in \mathbb{R}^{h \times N \times d_{qk}} $, and $ V(X) \in \mathbb{R}^{h \times N \times d_v} $. |
| $ U(X) $ | HSTU uses $ U(X) $ to “gate” attention-pooled values ( $ V(X) $) in Equation (3), which together with $ f_2(\cdot) $, enables HSTU to avoid feedforward layers altogether. $ U(X) \in \mathbb{R}^{h \times N \times d_v} $. |
| $ A(X) $ | Attention tensor obtained for input X. $ A(X) \in \mathbb{R}^{h \times N \times N} $. |
| $ Y(X) $ | Output of a HSTU layer obtained for the input X. $ Y(X) \in \mathbb{R}^{d} $. |
| Split( $ \cdot $) | The operation that splits a tensor into chunks. $ \phi_1(f_1(X)) \in \mathbb{R}^{N \times (2hd_{qk} + 2hd_v)} $ in Equation (1); we obtain $ U(X) $, $ V(X) $ (both of shape $ h \times N \times d_v $), $ Q(X) $, $ K(X) $ (both of shape $ h \times N \times d_{qk} $) by splitting the larger tensor (and permitting dimensions) with $ U(X) $, $ V(X) $, $ Q(X) $, $ K(X) = \text{Split}(\phi_1(f_1(X))) $. |
| $ \text{rab}^{p,t} $ | relative attention bias that incorporates both positional (Raffel et al., 2020) and temporal information (based on the time when the tokens are observed, $ t_0, \ldots, t_{n-1} $; one possible implementation is to apply some bucketization function to $ (t_j - t_i) $ for $ (i, j) $). In practice, we share $ \text{rab}^{p,t} $ across different attention heads within a layer, hence $ \text{rab}^{p,t} \in \mathbb{R}^{1 \times N \times N} $. |
| $ \alpha $ | Parameter controlling sparsity in the Stochastic Length algorithm used in HSTU (Section 3.2). |
| $ R $ | Register size on GPUs, in the context of the HSTU algorithm discussed in Section 3.2. |
| m | Number of candidates considered in a recommendation system's ranking stage. |
| $ b_m $ | Microbatch size, in the M-FALCON algorithm discussed in Section 3.4. |
| Input for target item $ i $ | Expected output for target item $ i $ | Architecture | Training Procedure | |
| GRs | $ \Phi_0, a_0, \Phi_1, a_1, ..., \Phi_i $ | $ a_i $ (target-aware) | Self-attention (HSTU) | Causal autoregressive (streaming/single-pass) |
| GRU4Rec\nSASRec | $ \Phi_0, \Phi_1, ..., \Phi_{i-1} $ | $ \Phi_i $ | RNNs (GRUs)\nSelf-attention (Transformers) | Causal autoregressive (multi-pass) |
| BERT4Rec\nS3Rec | $ \Phi_0, \Phi_1, ..., \Phi_{i-1} $\n(at inference time) | $ \Phi_i $ | Self-attention (Transformers) | Sequential multi-pass $ ^6 $ |
| DIN\nBST\nTWIN\nTransAct | $ \Phi_0, \Phi_1, ..., \Phi_i $\n $ (\Phi_0, a_0), ..., (\Phi_{i-1}, a_{i-1}), \Phi_i $ | $ a_i $ (target aware, implicitly as part of DLRMs) | Pairwise attention\nSelf-attention (Transformers)\nTwo-stage pairwise attention\nSelf-attention (Transformers) | Pointwise (generally streaming/single pass) |
| Task | Specification (Inputs / Outputs / Length) |
| $ x_{i}s $ | $ \Phi_{0}, a_{0}, \Phi_{1}, a_{1}, ..., \Phi_{n_{c}-2}, a_{n_{c}-2}, \Phi_{n_{c}-1}, a_{n_{c}-1} $ |
| $ y_{i}s $ | $ a_{0}, \varnothing, a_{1}, \varnothing, ..., a_{n_{c}-2}, \varnothing, a_{n_{c}-1}, \varnothing $ |
| $ n $ | $ 2n_{c} $ |
| $ x_{i}s $ | $ \Phi_{0}, a_{0}, \Phi_{1}, a_{1}, ..., \Phi_{n_{c}-2}, a_{n_{c}-2}, \Phi_{n_{c}-1}, a_{n_{c}-1} $ |
| $ y_{i}s $ | $ \varnothing, \Phi_{1}, \varnothing, \Phi_{2}, ..., \varnothing, \Phi_{n_{c}-1}, \varnothing, \varnothing $ |
| $ n $ | $ 2n_{c} $ |

| Method | HR@10 | HR@50 | HR@200 | NDCG@10 | NDCG@200 | |
| ML-1M | SASRec (2023) | .2853 | .5474 | .7528 | .1603 | .2498 |
| BERT4Rec | .2843 (-0.4%) | - | - | .1537 (-4.1%) | - | |
| GRU4Rec | .2811 (-1.5%) | - | - | .1648 (+2.8%) | - | |
| HSTU | .3097 (+8.6%) | .5754 (+5.1%) | .7716 (+2.5%) | .1720 (+7.3%) | .2606 (+4.3%) | |
| HSTU-large | .3294 (+15.5%) | .5935 (+8.4%) | .7839 (+4.1%) | .1893 (+18.1%) | .2771 (+10.9%) | |
| ML-20M | SASRec (2023) | .2906 | .5499 | .7655 | .1621 | .2521 |
| BERT4Rec | .2816 (-3.4%) | - | - | .1703 (+5.1%) | - | |
| GRU4Rec | .2813 (-3.2%) | - | - | .1730 (+6.7%) | - | |
| HSTU | .3252 (+11.9%) | .5885 (+7.0%) | .7943 (+3.8%) | .1878 (+15.9%) | .2774 (+10.0%) | |
| HSTU-large | .3567 (+22.8%) | .6149 (+11.8%) | .8076 (+5.5%) | .2106 (+30.0%) | .2971 (+17.9%) | |
| Books | SASRec (2023) | .0292 | .0729 | .1400 | .0156 | .0350 |
| HSTU | .0404 (+38.4%) | .0943 (+29.5%) | .1710 (+22.1%) | .0219 (+40.6%) | .0450 (+28.6%) | |
| HSTU-large | .0469 (+60.6%) | .1066 (+46.2%) | .1876 (+33.9%) | .0257 (+65.8%) | .0508 (+45.1%) |

| Metric Name | Selection Type | ||
| Greedy | Weighted | Random | |
| Main Engagement Metric (NE) | 0.495 | 0.494 | 0.495 |
| Main Consumption Metric (NE) | 0.792 | 0.789 | 0.791 |
| Alpha | Max Sequence Length | |||||||
| 1,024 | 2,048 | 4,096 | 8,192 | |||||
| sparsity | s2 | sparsity | s2 | sparsity | s2 | sparsity | s2 | |
| 1.6 | 71.5% | 89.4% | 75.8% | 92.3% | 79.4% | 94.7% | 83.8% | 97.3% |
| 1.7 | 57.3% | 77.6% | 60.6% | 79.8% | 67.3% | 86.6% | 74.5% | 93.3% |
| 1.8 | 37.5% | 56.2% | 42.6% | 62.1% | 51.9% | 74.2% | 62.6% | 85.5% |
| 1.9 | 15.0% | 25.2% | 17.7% | 29.0% | 29.6% | 47.5% | 57.8% | 80.9% |
| 2.0 | 1.2% | 1.7% | 2.5% | 3.5% | 18.9% | 30.8% | 57.6% | 80.6% |
| Alpha | Max Sequence Length | |||||||
| 1,024 | 2,048 | 4,096 | 8,192 | |||||
| sparsity | s2 | sparsity | s2 | sparsity | s2 | sparsity | s2 | |
| 1.6 | 68.0% | 85.0% | 74.6% | 90.8% | 78.6% | 93.5% | 83.5% | 97.3% |
| 1.7 | 56.3% | 76.1% | 61.2% | 80.6% | 67.5% | 87.0% | 74.3% | 93.3% |
| 1.8 | 38.9% | 58.3% | 42.0% | 61.3% | 50.4% | 72.4% | 61.0% | 84.4% |
| 1.9 | 16.2% | 27.3% | 17.3% | 28.6% | 27.2% | 44.4% | 54.3% | 77.8% |
| 2.0 | 0.9% | 1.2% | 1.6% | 2.1% | 13.5% | 22.5% | 54.0% | 77.4% |




| Evaluation Strategy | Average NE Difference vs Full Sequence Baseline | ||
| Model Type | 2048 / 52% Sparsity | 4096 / 75% Sparsity | |
| Zero-shot | HSTU (Raffel et al., 2020) | 6.46% | 10.35% |
| HSTU-RoPE (Peng et al., 2024) | 7.51% | 11.27% | |
| Fine-tune | HSTU (Raffel et al., 2020) | 1.92% | 2.21% |
| HSTU-RoPE (Peng et al., 2024) | 1.61% | 2.19% | |
| Stochastic Length (SL) | HSTU | 0.098% | 0.64% |




Chu+QHlyV`wsg)Vm
z-avC3XN8qoQJ^yi32t^%`XM!BRQ=AH`SRw~{Yo=!-grkzs;kY;KgGdCWNniQ?v&Lz
zXvP?itrsyaB{asIo--5XWR=rKMpnhJDdAhx4GBLFh$;Fk
z&!)N}$EQD5$$VsrR<3@lKgR9yR5DthHh=W>uW&+X^#?_pk+-w W5?UGDMP1R=&_M~zm^tz?22A_
zd&EQ6C%(VB{yhAM%Tl }`i
zjck?M-`&Y_7=Fz~j9p-grv$4I*h4a!q~?w@M3*iOu{rSUsAUN)A7)xRxi!zu6BoN~
z3RbxP mg{3ary(w<9y4U9^_tqf(1tbsqma{2it3UK8xLz?baETb!yen
zq0N=kmlzM-iS5RHS&8c`6%p)71b0vHhE=4gx-YrOFhEfu&D?)*--UVxba@%e8dzCK
z(=r}-K#0FmUq_x=-h*DULHUjEE39Iigt4J}uH#62x(wlHiVjl#C+Ip8P$TF7r6m8@mX}7`
z11AKGp*{zc45&cs&oSG`URtZM$b>gO
zaw%KNCU13T%~@A6CR3S{*xCvg8Ccw@DH
OqHueiJT}PS0@Rc^qVJ>{7&inxDVC^6mzrccA~bGA2cE#Rh6eK1&){64--$P>~xw5Az@@wzX{wt&iayz@^