# Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations Jiaqi Zhai $ ^{1} $ Lucy Liao $ ^{1} $ Xing Liu $ ^{1} $ Yueming Wang $ ^{1} $ Rui Li $ ^{1} $ Xuan Cao $ ^{1} $ Leon Gao $ ^{1} $ Zhaojie Gong $ ^{1} $ Fangda Gu $ ^{1} $ Michael He $ ^{1} $ Yinghai Lu $ ^{1} $ Yu Shi $ ^{1} $ ## Abstract Large-scale recommendation systems are characterized by their reliance on high cardinality, heterogeneous features and the need to handle tens of billions of user actions on a daily basis. Despite being trained on huge volume of data with thousands of features, most Deep Learning Recommendation Models (DLRMs) in industry fail to scale with compute. Inspired by success achieved by Transformers in language and vision domains, we revisit fundamental design choices in recommendation systems. We reformulate recommendation problems as sequential transduction tasks within a generative modeling framework (“Generative Recommenders”), and propose a new architecture, HSTU, designed for high cardinality, non-stationary streaming recommendation data. HSTU outperforms baselines over synthetic and public datasets by up to 65.8% in NDCG, and is 5.3x to 15.2x faster than FlashAttention2-based Transformers on 8192 length sequences. HSTU-based Generative Recommenders, with 1.5 trillion parameters, improve metrics in online A/B tests by 12.4% and have been deployed on multiple surfaces of a large internet platform with billions of users. More importantly, the model quality of Generative Recommenders empirically scales as a power-law of training compute across three orders of magnitude, up to GPT-3/LLaMa-2 scale, which reduces carbon footprint needed for future model developments, and further paves the way for the first foundation models in recommendations. ### 1. Introduction Recommendation systems, quintessential in the realm of online content platforms and e-commerce, play a pivotal role

| Symbol | Description |
| $ \Psi_{k}(t_{j}) $ | The k-th training example (k is ordered globally) emitted by the feature logging system at time $ t_{j} $. In a typical DLRM recommendation system, after the user consumes some content $ \Phi_{i} $ (by responding with an action $ a_{i} $ such as skip, video completion and share), the feature logging system joins the tuple $ (\Phi_{i}, a_{i}) $ with the features used to rank $ \Phi_{i} $, and emits $ (\Phi_{i}, a_{i}) $ features for $ \Phi_{i} $ as a training example $ \Psi_{k}(t_{j}) $. As discussed in Section 2.3, DLRMs and GRs deal with different numbers of training examples, with the number of examples in GRs typically being 1-2 orders of magnitude smaller. |
| $ n_{c}(n_{c,i}) $ | Number of contents that user has interacted with (of user/sample i). |
| $ \Phi_{0}, \dots, \Phi_{n_{c}-1} $ | List of contents that a user has interacted with, in the context of a recommendation system. List of user actions corresponding to $ \Phi_{i} $s. When all predicted events are binary, each action can be considered a multi-hot vector over (atomic) events such as like, share, comment, image view, video initialization, video completion, hide, etc. |
| $ a_{0}, \dots, a_{n_{c}-1} $ | List of user actions corresponding to the value of $ a_{0} $, the value of $ a_{1} $, the value of $ a_{2} $, the value of $ a_{3} $, the value of $ a_{4} $, the value of $ a_{5} $, the value of $ a_{6} $, the value of $ a_{7} $, the value of $ a_{8} $, the value of $ a_{9} $, the value of $ a_{10} $, the value of $ a_{11} $, the value of $ a_{12} $, the value of $ a_{13} $, the value of $ a_{14} $, the value of $ a_{15} $, the value of $ a_{16} $, the value of $ a_{17} $, the value of $ a_{18} $, the value of $ a_{19} $, the value of $ a_{20} $, the value of $ a_{21} $, the value of $ a_{22} $, the value of $ a_{23} $, the value of $ a_{24} $, the value of $ a_{25} $, the value of $ a_{26} $, the value of $ a_{27} $, the value of $ a_{28} $, the value of $ a_{29} $, the value of $ a_{30} $, the value of $ a_{31} $, the value of $ a_{32} $, the value of $ a_{33} $, the value of $ a_{34} $, the value of $ a_{35} $, the value of $ a_{36} $, the value of $ a_{37} $, the value of $ a_{38} $, the value of $ a_{39} $, the value of $ a_{40} $, the value of $ a_{41} $, the value of $ a_{42} $, the value of $ a_{43} $, the value of $ a_{44} $, the value of $ a_{45} $, the value of $ a_{46} $, the value of $ a_{47} $, the value of $ a_{48} $, the value of $ a_{49} $, the value of $ a_{50} $, the value of $ a_{51} $, the value of $ a_{52} $, the value of $ a_{53} $, the value of $ a_{54} $, the value of $ a_{55} $, the value of $ a_{56} $, the value of $ a_{57} $, the value of $ a_{58} $, the value of $ a_{59} $, the value of $ a_{60} $, the value of $ a_{61} $, the value of $ a_{62} $, the value of $ a_{63} $, the value of $ a_{64} $, the value of $ a_{65} $, the value of $ a_{66} $, the value of $ a_{67} $, the value of $ a_{68} $, the value of $ a_{69} $, the value of $ a_{70} $, the value of $ a_{71} $, the value of $ a_{72} $, the value of $ a_{73} $, the value of $ a_{74} $, the value of $ a_{75} $, the value of $ a_{76} $, the value of $ a_{77} $, the value of $ a_{78} $, the value of $ a_{79} $, the value of $ a_{80} $, the value of $ a_{81} $, the value of $ a_{82} $, the value of $ a_{83} $, the value of $ a_{84} $, the value of $ a_{85} $, the value of $ a_{86} $, the value of $ a_{87} $, the value of $ a_{88} $, the value of $ a_{89} $, the value of $ a_{90} $, the value of $ a_{91} $, the value of $ a_{92} $, the value of $ a_{93} $, the value of $ a_{94} $, the value of $ a_{95} $, the value of $ a_{96} $, the value of $ a_{97} $, the value of $ a_{98} $, the value of $ a_{99} $, the value of $ a_{10} $, the value of $ a_{11} $, the value of $ a_{12} $, the value of $ a_{13} $, the value of $ a_{14} $, the value of $ a_{15} $, the value of $ a_{16} $, the value of $ a_{17} $, the value of $ a_{18} $, the value of $ a_{19} $, the value of $ a_{20} $, the value of $ a_{21} $, the value of $ a_{22} $, the value of $ a_{23} $, the value of $ a_{24} $, the value of $ a_{25} $, the value of $ a_{26} $, the value of $ a_{27} $, the value of $ a_{28} $, the value of $ a_{29} $, the value of $ a_{30} $, the value of $ a_{31} $, the value of $ a_{32} $, the value of $ a_{33} $, the value of $ a_{34} $, the value of $ a_{35} $, the value of $ a_{36} $, the value of $ a_{37} $, the value of $ a_{38} $, the value of $ a_{39} $, the value of $ a_{40} $, the value of $ a_{41} $, the value of $ a_{42} $, the value of $ a_{43} $, the value of $ a_{44} $, the value of $ a_{45} $, the value of $ a_{46} $, the value of $ a_{47} $, the value of $ a_{48} $, the value of $ a_{49} $, the value of $ a_{50} $, the value of $ a_{51} $, the value of $ a_{52} $, the value of $ a_{53} $, the value of $ a_{54} $, the value of $ a_{55} $, the value of $ a_{56} $, the value of $ a_{57} $, the value of $ a_{58} $, the value of $ a_{59} $, the value of $ a_{60} $, the value of $ a_{61} $, the value of $ a_{62} $, the value of $ a_{63} $, the value of $ a_{64} $, the value of $ a_{65} $, the value of $ a_{66} $, the value of $ a_{67} $, the value of $ a_{68} $, the value of $ a_{69} $, the value of $ a_{70} $, the value of $ a_{71} $, the value of $ a_{72} $, the value of $ a_{73} $, the value of $ a_{74} $, the value of $ a_{75} $, the value of $ a_{76} $, the value of $ a_{77} $, the value of $ a_{78} $, the value of $ a_{79} $, the value of $ a_{80} $, the value of $ a_{81} $, the value of $ a_{82} $, the value of $ a_{83} $, the value of $ a_{84} $, the value of $ a_{85} $, the value of $ a_{86} $, the value of $ a_{87} $, the value of $ a_{88} $, the value of $ a_{89} $, the value of $ a_{90} $, the value of $ a_{91} $, the value of $ a_{92} $, the value of $ a_{93} $, the value of $ a_{94} $, the value of $ a_{95} $, the value of $ a_{96} $, the value of $ a_{97} $, the value of $ a_{98} $, the value of $ a_{99} $, the value of $ a_{10} $, the value of $ a_{11} $, the value of $ a_{12} $, the value of $ a_{13} $, the value of $ a_{14} $, the value of $ a_{15} $, the value of $ a_{16} $, the value of $ a_{17} $, the value of $ a_{18} $, the value of $ a_{19} $, the value of $ a_{20} $, the value of $ a_{21} $, the value of $ a_{22} $, the value of $ a_{23} $, the value of $ a_{24} $, the value of $ a_{25} $, the value of $ a_{26} $, the value of $ a_{27} $, the value of $ a_{28} $, the value of $ a_{29} $, the value of $ a_{30} $, the value of $ a_{31} $, the value of $ a_{32} $, the value of $ a_{33} $, the value of $ a_{34} $, the value of $ a_{35} $, the value of $ a_{36} $, the value of $ a_{37} $, the value of $ a_{38} $, the value of $ a_{39} $, the value of $ a_{40} $, the value of $ a_{41} $, the value of $ a_{42} $, the value of $ a_{43} $, the value of $ a_{44} $, the value of $ a_{45} $, the value of $ a_{46} $, the value of $ a_{47} $, the value of $ a_{48} $, the value of $ a_{49} $, the value of $ a_{50} $, the value of $ a_{51} $, the value of $ a_{52} $, the value of $ a_{53} $, the value of $ a_{54} $, the value of $ a_{55} $, the value of $ a_{56} $, the value of $ a_{57} $, the value of $ a_{58} $, the value of $ a_{59} $, the value of $ a_{60} $, the value of $ a_{61} $, the value of $ a_{62} $, the value of $ a_{63} $, the value of $ a_{64} $, the value of $ a_{65} $, the value of $ a_{66} $, the value of $ a_{67} $, the value of $ a_{68} $, the value of $ a_{69} $, the value of $ a_{70} $, the value of $ a_{71} $, the value of $ a_{72} $, the value of $ a_{73} $, the value of $ a_{74} $, the value of $ a_{75} $, the value of $ a_{76} $, the value of $ a_{77} $, the value of $ a_{78} $, the value of $ a_{79} $, the value of $ a_{80} $, the value of $ a_{81} $, the value of $ a_{82} $, the value of $ a_{83} $, the value of $ a_{84} $, the value of $ a_{85} $, the value of $ a_{86} $, the value of $ a_{87} $, the value of $ a_{88} $, the value of $ a_{89} $, the value of $ a_{90} $, the value of $ a_{91} $, the value of $ a_{92} $, the value of $ a_{93} $, the value of $ a_{94} $, the value of $ a_{95} $, the value of $ a_{96} $, the value of $ a_{97} $, the value of $ a_{98} $, the value of $ a_{99} $, the value of $ a_{10} $, the value of $ a_{11} $, the value of $ a_{12} $, the value of $ a_{13} $, the value of $ a_{14} $, the value of $ a_{15} $, the value of $ a_{16} $, the value of $ a_{17} $, the value of $ a_{18} $, the value of $ a_{19} $, the value of $ a_{20} $, the value of $ a_{21} $, the value of $ a_{22} $, the value of $ a_{23} $, the value of $ a_{24} $, the value of $ a_{25} $, the value of $ a_{26} $, the value of $ a_{27} $, the value of $ a_{28} $, the value of $ a_{29} $, the value of $ a_{30} $, the value of $ a_{31} $, the value of $ a_{32} $, the value of $ a_{33} $, the value of $ a_{34} $, the value of $ a_{35} $, the value of $ a_{36} $, the value of $ a_{37} $, the value of $ a_{38} $, the value of $ a_{39} $, the value of $ a_{40} $, the value of $ a_{41} $, the value of $ a_{42} $, the value of $ a_{43} $, the value of $ a_{44} $, the value of $ a_{45} $, the value of $ a_{46} $, the value of $ a_{47} $, the value of $ a_{48} $, the value of $ a_{49} $, the value of $ a_{50} $, the value of $ a_{51} $, the value of $ a_{52} $, the value of $ a_{53} $, the value of $ a_{54} $, the value of $ a_{55} $, the value of $ a_{56} $, the value of $ a_{57} $, the value of $ a_{58} $, the value of $ a_{59} $, the value of $ a_{60} $, the value of $ a_{61} $, the value of $ a_{62} $, the value of $ a_{63} $, the value of $ a_{64} $, the value of $ a_{65} $, the value of $ a_{66} $, the value of $ a_{67} $, the value of $ a_{68} $, the value of $ a_{69} $, the value of $ a_{70} $, the value of $ a_{71} $, the value of $ a_{72} $, the value of $ a_{73} $, the value of $ a_{74} $, the value of $ a_{75} $, the value of $ a_{76} $, the value of $ a_{77} $, the value of $ a_{78} $, the value of $ a_{79} $, the value of $ a_{80} $, the value of $ a_{81} $, the value of $ a_{82} $, the value of $ a_{83} $, the value of $ a_{84} $, the value of $ a_{85} $, the value of $ a_{86} $, the value of $ a_{87} $, the value of $ a_{88} $, the value of $ a_{89} $, the value of $ a_{90} $, the value of $ a_{91} $, the value of $ a_{92} $, the value of $ a_{93} $, the value of $ a_{94} $, the value of $ a_{95} $, the value of $ a_{96} $, the value of $ a_{97} $, the value of $ a_{98} $, the value of $ a_{99} $, the value of $ a_{10} $, the value of $ a_{11} $, the value of $ a_{12} $, the value of $ a_{13} $, the value of $ a_{14} $, the value |
| Symbol | Description |
| X | Input to an HSTU layer. In standard terminology (before batching), $ X \in \mathbb{R}^{N \times d} $ assuming we have a input sequence containing N tokens. |
| $ Q(X) $, $ K(X) $, $ V(X) $ | Query, key, value in HSTU obtained for a given input X based on Equation (1). The definition is similar to Q, K, and V in standard Transformers. $ Q(X) $, $ K(X) \in \mathbb{R}^{h \times N \times d_{qk}} $, and $ V(X) \in \mathbb{R}^{h \times N \times d_v} $. |
| $ U(X) $ | HSTU uses $ U(X) $ to “gate” attention-pooled values ( $ V(X) $) in Equation (3), which together with $ f_2(\cdot) $, enables HSTU to avoid feedforward layers altogether. $ U(X) \in \mathbb{R}^{h \times N \times d_v} $. |
| $ A(X) $ | Attention tensor obtained for input X. $ A(X) \in \mathbb{R}^{h \times N \times N} $. |
| $ Y(X) $ | Output of a HSTU layer obtained for the input X. $ Y(X) \in \mathbb{R}^{d} $. |
| Split( $ \cdot $) | The operation that splits a tensor into chunks. $ \phi_1(f_1(X)) \in \mathbb{R}^{N \times (2hd_{qk} + 2hd_v)} $ in Equation (1); we obtain $ U(X) $, $ V(X) $ (both of shape $ h \times N \times d_v $), $ Q(X) $, $ K(X) $ (both of shape $ h \times N \times d_{qk} $) by splitting the larger tensor (and permitting dimensions) with $ U(X) $, $ V(X) $, $ Q(X) $, $ K(X) = \text{Split}(\phi_1(f_1(X))) $. |
| $ \text{rab}^{p,t} $ | relative attention bias that incorporates both positional (Raffel et al., 2020) and temporal information (based on the time when the tokens are observed, $ t_0, \ldots, t_{n-1} $; one possible implementation is to apply some bucketization function to $ (t_j - t_i) $ for $ (i, j) $). In practice, we share $ \text{rab}^{p,t} $ across different attention heads within a layer, hence $ \text{rab}^{p,t} \in \mathbb{R}^{1 \times N \times N} $. |
| $ \alpha $ | Parameter controlling sparsity in the Stochastic Length algorithm used in HSTU (Section 3.2). |
| $ R $ | Register size on GPUs, in the context of the HSTU algorithm discussed in Section 3.2. |
| m | Number of candidates considered in a recommendation system's ranking stage. |
| $ b_m $ | Microbatch size, in the M-FALCON algorithm discussed in Section 3.4. |
| Input for target item $ i $ | Expected output for target item $ i $ | Architecture | Training Procedure | |
| GRs | $ \Phi_0, a_0, \Phi_1, a_1, ..., \Phi_i $ | $ a_i $ (target-aware) | Self-attention (HSTU) | Causal autoregressive (streaming/single-pass) |
| GRU4Rec\nSASRec | $ \Phi_0, \Phi_1, ..., \Phi_{i-1} $ | $ \Phi_i $ | RNNs (GRUs)\nSelf-attention (Transformers) | Causal autoregressive (multi-pass) |
| BERT4Rec\nS3Rec | $ \Phi_0, \Phi_1, ..., \Phi_{i-1} $\n(at inference time) | $ \Phi_i $ | Self-attention (Transformers) | Sequential multi-pass $ ^6 $ |
| DIN\nBST\nTWIN\nTransAct | $ \Phi_0, \Phi_1, ..., \Phi_i $\n $ (\Phi_0, a_0), ..., (\Phi_{i-1}, a_{i-1}), \Phi_i $ | $ a_i $ (target aware, implicitly as part of DLRMs) | Pairwise attention\nSelf-attention (Transformers)\nTwo-stage pairwise attention\nSelf-attention (Transformers) | Pointwise (generally streaming/single pass) |
| Task | Specification (Inputs / Outputs / Length) |
| $ x_{i}s $ | $ \Phi_{0}, a_{0}, \Phi_{1}, a_{1}, ..., \Phi_{n_{c}-2}, a_{n_{c}-2}, \Phi_{n_{c}-1}, a_{n_{c}-1} $ |
| $ y_{i}s $ | $ a_{0}, \varnothing, a_{1}, \varnothing, ..., a_{n_{c}-2}, \varnothing, a_{n_{c}-1}, \varnothing $ |
| $ n $ | $ 2n_{c} $ |
| $ x_{i}s $ | $ \Phi_{0}, a_{0}, \Phi_{1}, a_{1}, ..., \Phi_{n_{c}-2}, a_{n_{c}-2}, \Phi_{n_{c}-1}, a_{n_{c}-1} $ |
| $ y_{i}s $ | $ \varnothing, \Phi_{1}, \varnothing, \Phi_{2}, ..., \varnothing, \Phi_{n_{c}-1}, \varnothing, \varnothing $ |
| $ n $ | $ 2n_{c} $ |

| Method | HR@10 | HR@50 | HR@200 | NDCG@10 | NDCG@200 | |
| ML-1M | SASRec (2023) | .2853 | .5474 | .7528 | .1603 | .2498 |
| BERT4Rec | .2843 (-0.4%) | - | - | .1537 (-4.1%) | - | |
| GRU4Rec | .2811 (-1.5%) | - | - | .1648 (+2.8%) | - | |
| HSTU | .3097 (+8.6%) | .5754 (+5.1%) | .7716 (+2.5%) | .1720 (+7.3%) | .2606 (+4.3%) | |
| HSTU-large | .3294 (+15.5%) | .5935 (+8.4%) | .7839 (+4.1%) | .1893 (+18.1%) | .2771 (+10.9%) | |
| ML-20M | SASRec (2023) | .2906 | .5499 | .7655 | .1621 | .2521 |
| BERT4Rec | .2816 (-3.4%) | - | - | .1703 (+5.1%) | - | |
| GRU4Rec | .2813 (-3.2%) | - | - | .1730 (+6.7%) | - | |
| HSTU | .3252 (+11.9%) | .5885 (+7.0%) | .7943 (+3.8%) | .1878 (+15.9%) | .2774 (+10.0%) | |
| HSTU-large | .3567 (+22.8%) | .6149 (+11.8%) | .8076 (+5.5%) | .2106 (+30.0%) | .2971 (+17.9%) | |
| Books | SASRec (2023) | .0292 | .0729 | .1400 | .0156 | .0350 |
| HSTU | .0404 (+38.4%) | .0943 (+29.5%) | .1710 (+22.1%) | .0219 (+40.6%) | .0450 (+28.6%) | |
| HSTU-large | .0469 (+60.6%) | .1066 (+46.2%) | .1876 (+33.9%) | .0257 (+65.8%) | .0508 (+45.1%) |

| Task | Specification (Inputs / Outputs) | |
| Ranking | $ x_{i}s $ | $ \Phi_{0}, a_{0}, \Phi_{1}, a_{1}, \ldots, \Phi_{n_{c}-1}, a_{n_{c}-1} $ |
| $ y_{i}s $ | $ a_{0}, \varnothing, a_{1}, \varnothing, \ldots, a_{n_{c}-1}, \varnothing $ | |
| Retrieval | $ x_{i}s $ | $ (\Phi_{0}, a_{0}), (\Phi_{1}, a_{1}), \ldots, (\Phi_{n_{c}-1}, a_{n_{c}-1}) $ |
| $ y_{i}s $ | $ \Phi_{1}^{\prime}, \Phi_{2}^{\prime}, \ldots, \Phi_{n_{c}-1}^{\prime}, \varnothing $ | |
| $ (\Phi_{i}^{\prime}] = \Phi_{i} $ if $ a_{i} $ is positive, otherwise $ \varnothing $ | ||

| Metric Name | Selection Type | ||
| Greedy | Weighted | Random | |
| Main Engagement Metric (NE) | 0.495 | 0.494 | 0.495 |
| Main Consumption Metric (NE) | 0.792 | 0.789 | 0.791 |
| Alpha | Max Sequence Length | |||||||
| 1,024 | 2,048 | 4,096 | 8,192 | |||||
| sparsity | s2 | sparsity | s2 | sparsity | s2 | sparsity | s2 | |
| 1.6 | 71.5% | 89.4% | 75.8% | 92.3% | 79.4% | 94.7% | 83.8% | 97.3% |
| 1.7 | 57.3% | 77.6% | 60.6% | 79.8% | 67.3% | 86.6% | 74.5% | 93.3% |
| 1.8 | 37.5% | 56.2% | 42.6% | 62.1% | 51.9% | 74.2% | 62.6% | 85.5% |
| 1.9 | 15.0% | 25.2% | 17.7% | 29.0% | 29.6% | 47.5% | 57.8% | 80.9% |
| 2.0 | 1.2% | 1.7% | 2.5% | 3.5% | 18.9% | 30.8% | 57.6% | 80.6% |
| Alpha | Max Sequence Length | |||||||
| 1,024 | 2,048 | 4,096 | 8,192 | |||||
| sparsity | s2 | sparsity | s2 | sparsity | s2 | sparsity | s2 | |
| 1.6 | 68.0% | 85.0% | 74.6% | 90.8% | 78.6% | 93.5% | 83.5% | 97.3% |
| 1.7 | 56.3% | 76.1% | 61.2% | 80.6% | 67.5% | 87.0% | 74.3% | 93.3% |
| 1.8 | 38.9% | 58.3% | 42.0% | 61.3% | 50.4% | 72.4% | 61.0% | 84.4% |
| 1.9 | 16.2% | 27.3% | 17.3% | 28.6% | 27.2% | 44.4% | 54.3% | 77.8% |
| 2.0 | 0.9% | 1.2% | 1.6% | 2.1% | 13.5% | 22.5% | 54.0% | 77.4% |




| Evaluation Strategy | Average NE Difference vs Full Sequence Baseline | ||
| Model Type | 2048 / 52% Sparsity | 4096 / 75% Sparsity | |
| Zero-shot | HSTU (Raffel et al., 2020) | 6.46% | 10.35% |
| HSTU-RoPE (Peng et al., 2024) | 7.51% | 11.27% | |
| Fine-tune | HSTU (Raffel et al., 2020) | 1.92% | 2.21% |
| HSTU-RoPE (Peng et al., 2024) | 1.61% | 2.19% | |
| Stochastic Length (SL) | HSTU | 0.098% | 0.64% |





| Architecture | HR@10 | HR@50 |
| Transformers | .0442 | .2025 |
| HSTU $ (-rab^{p,t}, Softmax) $ | .0617 | .2496 |
| HSTU $ (-rab^{p,t}) $ | .0893 | .3170 |
| Alpha ( $ \alpha $) | Max Sequence Lengths | |||
| 1,024 | 2,048 | 4,096 | 8,192 | |
| 1.6 | 71.5% | 76.1% | 80.5% | 84.4% |
| 1.7 | 56.1% | 63.6% | 69.8% | 75.6% |
| 1.8 | 40.2% | 45.3% | 54.1% | 66.4% |
| 1.9 | 17.2% | 21.0% | 36.3% | 64.1% |
| 2.0 | 3.1% | 6.6% | 29.1% | 64.1% |
| Method | HR@10 | HR@50 | HR@200 | NDCG@10 | NDCG@200 | |
| ML-1M | SASRec (2023) | .2853 | .5474 | .7528 | .1603 | .2498 |
| HSTU | .3097 (+8.6%) | .5754 (+5.1%) | .7716 (+2.5%) | .1720 (+7.3%) | .2606 (+4.3%) | |
| HSTU-large | .3294 (+15.5%) | .5935 (+8.4%) | .7839 (+4.1%) | .1893 (+18.1%) | .2771 (+10.9%) | |
| ML-20M | SASRec (2023) | .2906 | .5499 | .7655 | .1621 | .2521 |
| HSTU | .3252 (+11.9%) | .5885 (+7.0%) | .7943 (+3.8%) | .1878 (+15.9%) | .2774 (+10.0%) | |
| HSTU-large | .3567 (+22.8%) | .6149 (+11.8%) | .8076 (+5.5%) | .2106 (+30.0%) | .2971 (+17.9%) | |
| Books | SASRec (2023) | .0292 | .0729 | .1400 | .0156 | .0350 |
| HSTU | .0404 (+38.4%) | .0943 (+29.5%) | .1710 (+22.1%) | .0219 (+40.6%) | .0450 (+28.6%) | |
| HSTU-large | .0469 (+60.6%) | .1066 (+46.2%) | .1876 (+33.9%) | .0257 (+65.8%) | .0508 (+45.1%) |
| Architecture | Retrieval log pplx. | Ranking (NE) | |
| E-Task | C-Task | ||
| Transformers | 4.069 | NaN | NaN |
| HSTU ( $ -rab^{{p,t}} $, Softmax) | 4.024 | .5067 | .7931 |
| HSTU ( $ -rab^{{p,t}} $) | 4.021 | .4980 | .7860 |
| Transformer++ | 4.015 | .4945 | .7822 |
| HSTU (original rab) | 4.029 | .4941 | .7817 |
| HSTU | 3.978 | .4937 | .7805 |


| Methods | Offline HR@K | Online metrics | ||
| K=100 | K=500 | E-Task | C-Task | |
| DLRM | 29.0% | 55.5% | +0% | +0% |
| DLRM (abl. features) | 28.3% | 54.3% | - | |
| GR (content-based) | 11.6% | 18.8% | - | |
| GR (interactions only) | 35.6% | 61.7% | - | |
| GR (new source) | 36.9% | 62.4% | +6.2% | +5.0% |
| GR (replace source) | +5.1% | +1.9% | ||
| Methods | Offline NEs | Online metrics | ||
| E-Task | C-Task | E-Task | C-Task | |
| DLRM | .4982 | .7842 | +0% | +0% |
| DLRM (DIN+DCN) | .5053 | .7899 | - | - |
| DLRM (abl. features) | .5053 | .7925 | - | - |
| GR (interactions only) | .4851 | .7903 | - | - |
| GR | .4845 | .7645 | +12.4% | +4.4% |






