diff --git a/论文/GRAB.md b/论文/GRAB.md
index 1a53789..baf0b91 100644
--- a/论文/GRAB.md
+++ b/论文/GRAB.md
@@ -18,7 +18,8 @@ Departing from the structural constraints of DLRMs, the rise of Large Language M
Despite these theoretical advancements, deploying GR models in high-throughput industrial systems remains challenging due to strict online serving and optimization constraints. The primary obstacle is computational efficiency. Standard Transformer training requires extensive padding for variable-length sequences, resulting in significant computational waste (Vaswani et al., 2017; Krell et al., 2021). While the sequence packing—a common Natural Language Processing (NLP) technique for concatenating multiple short sequences—effectively mitigates this issue (Krell et al., 2021), its straightforward application to recommendation systems triggers a more subtle yet damaging failure mode: Distribution Skew (Baylor et al., 2017; Polyzotis et al., 2019; Sculley et al., 2015; Han et al., 2025).
-In recommendations, packing a user's full history creates mini-batches with excessive intra-user correlation, which violates the i.i.d. assumption typically relied on by SGD-style optimization (Doan et al., 2020). This skew (details in Appendix D.1) causes sparse parameters (i.e., embedding tables) to overfit specific users, hindering the generalization of dense parameters (e.g., Transformer weights responsible for inference) (Naumov et al., 2019; Li et al., 2024b). This reveals a fundamental tension: sparse parameters require diverse, uncorrelated samples for robust “memorization”, whereas dense parameters benefit from long, coherent contexts for sequential “reasoning” (Cheng et al., 2016; Kang & McAuley, 2018; Sun et al., 2019). This misalignment implies that standard synchronous training on packed sequences may lead to suboptimal convergence due to the conflicting gradient requirements of the sparse and dense components (Yu et al., 2020).
+In recommendations, packing a user's full history creates mini-batches with excessive intra-user correlation, which violates the i.i.d. assumption typically relied on by SGD-
+style optimization (Doan et al., 2020). This skew (details in Appendix D.1) causes sparse parameters (i.e., embedding tables) to overfit specific users, hindering the generalization of dense parameters (e.g., Transformer weights responsible for inference) (Naumov et al., 2019; Li et al., 2024b). This reveals a fundamental tension: sparse parameters require diverse, uncorrelated samples for robust “memorization”, whereas dense parameters benefit from long, coherent contexts for sequential “reasoning” (Cheng et al., 2016; Kang & McAuley, 2018; Sun et al., 2019). This misalignment implies that standard synchronous training on packed sequences may lead to suboptimal convergence due to the conflicting gradient requirements of the sparse and dense components (Yu et al., 2020).
Meanwhile, existing GR models typically ignore data heterogeneity, resulting in performance limitations (see Appendix A.3 for detailed discussion). To overcome these challenges, we propose Generative Ranking for Ads at Baidu (GRAB), an end-to-end sequential training and inference framework for industrial-grade CTR prediction. GRAB introduces three core innovations to reconcile the demands for performance, efficiency, and training stability:
@@ -42,290 +43,8 @@ GR. Recent GR work models recommendation as causal Transformer-based sequential
#### 3.1. DLRMs
-The traditional DLRM architecture, as shown in Fig. 1, follows a modular processing pipeline for CTR prediction, handling raw features from users, candidate ads, and contextual signals. The pipeline involves: (a) expanding categorical features into fixed fields via feature engineering, (b) mapping these fields through hashing to obtain discrete ID vectors for embedding lookup in a Sparse Parameter Server Table (PSTable), and (c) concatenating and normalizing the retrieved embeddings to form a fixed-length flattened vector. This unified representation is then fed into an MLP, typically enhanced with a gating network, to model high-orderfeature interactions and generate the final CTR prediction.
-
-

-
-
-Figure 1. The traditional DLRM architecture: sparse features are hashed to IDs and embedded via PSTable, and then concatenated into a fixed-length flattened vector for CTR prediction.
-
-
-#### 3.2. Overall Architecture of GRAB
-
-GRAB, with the overall architecture shown in Fig. 2, is designed to model user behavior history sequences in an end-to-end manner, as applied in scenarios like CTR prediction. GRAB follows a three-stage pipeline: (i) sparse feature layer; (ii) dense tokenizer; and (iii) sequence modeling layer. Given raw behavior logs, GRAB first converts heterogeneous categorical signals into sparse IDs at the event level, then tokenizes each event into a dense representation, and finally applies a sequence model to estimate the click probability of candidate ads. GRAB uses its dense representation calculated from the dense tokenizer to bridge DLRM-style sparse feature engineering and GR-style sequential modeling, enabling end-to-end training and inference along a single, unified computation path from input to output, thereby improving CTR prediction performance through end-to-end sequential modeling of event-level user behaviors.
-
-Sparse Feature Layer. The sparse feature layer (details in Appendix C.1) processes raw logs into time-ordered event sequences. Each event's categorical fields are converted into sparse IDs using standard DLRM feature engineering (Section 3.1), yielding a structured sequence of events annotated with field-wise IDs.
-
-Dense Tokenizer. Unlike DLRM, which collapses field embeddings into a fixed-length, order-agnostic vector for pointwise processing, GRAB preserves the temporal event structure. It aggregates per-event field embeddings and projects them into $ \mathbb{R}^{d_{model}} $ to form sequential event tokens (Appendix C.2), resulting in a time-ordered token sequence. This sequence serves as the input to a subsequent Transformer, thereby enabling the modeling of long-range dependencies and interest drift.
-
-Autoregressive-like Sequence Modeling Layer. Built on sequence packing (Section 3.3.1), heterogeneous tokens (Section 3.3.2), and action-aware relative attention bias (Section 3.3.3), our core contribution is the CamA mechanism (Section 3.3.4). CamA integrates a multi-channel design for parallel processing of diverse behaviors and inherits action-aware contextualization from RAB, providing a unified and efficient framework for modeling complex user interest patterns across scenarios.
-
-
-
-#### 3.3. Autoregressive-like Sequence Modeling Layer
-
-Following the dense tokenizer, this layer is designed to capture the temporal dependencies and dynamic evolution of user interests, which takes the sequence of dense event tokens generated by the preceding layer as input (as described in Appendix C.3). Formally, for a user $ u $, the input sequence consists of the behavior history $ \mathbf{E}^{\mathrm{beh}} = \{e_t^{\mathrm{beh}}\}_{t=1}^T $ and the candidate advertisements $ \mathbf{E}^{\mathrm{ad}} = \{e_i^{\mathrm{ad}}\}_{i=1}^N_u $, where $ \mathbf{e}_t^{\mathrm{beh}}, \mathbf{e}_i^{\mathrm{ad}} \in \mathbb{R}^{d_{\mathrm{model}}} $ are the dense embeddings of the $ t $-th behavior event and the $ i $-th candidate ad, respectively, $ T_u $ is the behavior history length, and $ N_u $ is the number of candidate ads.
-
-##### 3.3.1. SEQUENCE PACKING AND USER-ISOLATED CAUSAL MASK
-
-In industrial training logs, as shown in the left image of Fig. 3a., a mini-batch is typically formed by sampling $ B_{ins} $ impression instances. Each instance contains a variable-length token sequence composed of (i) a subsequence of the user's historical behavior tokens and (ii) target advertisement tokens to be scored. A straightforward batching strategy pads every instance to a fixed length $ L_{max} $, yielding a dense tensor with dimensions $ B_{ins} \times L_{max} \times d_{model} $,
-
-which introduces substantial computational waste when most instances are much shorter than $ L_{max} $.
-
-To eliminate such padding overhead while preserving the temporal semantics, GRAB performs sequence packing by grouping tokens by user. Specifically, tokens from multiple impression instances belonging to the same user u are merged into a single contiguous token segment, while segments of different users are strictly separated. Within each user segment, all tokens are stably sorted by timestamp so that the packed segment forms a single timeline for sequential modeling. After packing, the batch is represented as one long packed tensor $ H = \text{Pack}(\mathbf{E}^{beh}, \mathbf{E}^{ad}) \in \mathbb{R}^{1 \times L \times d_{model}} $, where $ L $ denotes the total packed length across all users in the mini-batch.
-
-For convenience, we associate each packed position $ p \in \{1, \ldots, L\} $ with (i) a segment $ id \sigma(p) \in U_B $ indicating which user it belongs to, and (ii) a local time index $ \ell(p) \in \{1, \ldots, L_{\sigma(p)}\} $ within that user segment.
-
-User-isolated causal mask. On the packed tensor $ H $, we construct an additive attention mask $ M^{\text{pack}} \in \mathbb{R}^{L \times L} $ that
-
-
-Figure 2. Overview of GRAB's end-to-end CTR prediction pipeline: (1) Tokenizing raw fields via a sparse PSTable and fusing them into event tokens. (2) Packing tokens per user with causal and heterogeneous masks. (3) Processing through N Transformer layers equipped with the Causal Action-aware Multi-channel Attention (CamA) mechanism. (4) Final CTR prediction from the output representations.
-
-
-enforces two constraints: (1) user isolation (no cross-user attention), and (2) causality within each user's timeline (no future leakage). Formally, for query position p and key position q,
-
- $$ M_{p,q}^{\mathrm{pack}}=\begin{cases}1,&if\sigma(p)=\sigma(q)and\ell(q)\leq\ell(p),\\0,&otherwise.\end{cases} $$
-
-This yields a block-diagonal lower-triangular structure (as shown in Fig. 3b), where each block corresponds to one user segment.
-
-##### 3.3.2. HETEROGENEOUS BEHAVIOR TOKENS AND HETEROGENEOUS VISIBILITY MASK
-
-After sequence packing, for each user $u$, we obtain a user-isolated, time-ordered packed stream with its causal mask $M_{\text{pack}}$. To further reduce redundancy in the packed history while preserving the information needed for scoring the current candidate, we instantiate two token views at each packed timestamp $t$: the partial token (history) $h_t \in \mathbb{R}^{d_{\text{model}}}$, which retains only time-varying information that is useful for representing history and discards static or highly repetitive fields (e.g., user_id) that would otherwise be duplicated across historical steps and could lead to overfitting; and the full token (candidate) $h'_t \in \mathbb{R}^{d_{\text{model}}}$, which retains the complete information required to score the candidate at time $t$, including the static fields omitted from the partial history view. We then interleave them to form the heterogeneous packed sequence: $H_u = [\mathbf{h}_1, \mathbf{h}_1', \mathbf{h}_2, \mathbf{h}_2', \ldots, \mathbf{h}_{T_u}, \mathbf{h}_{T_u}']$.
-
-Heterogeneous Visibility Mask. On top of the user-isolated causal constraint encoded by $ M^{pack} $, we apply a mask-rewriting operator $ \mathcal{R}(\cdot) $ to obtain the heterogeneous visibility mask $ M^{het} $. Concretely, $ \mathcal{R}(\cdot) $ rewrites the visibility pattern according to the token types in the following way: (i) partial $ (\mathcal{P}) $ tokens only attend to partial history tokens; and (ii) full $ (\mathcal{F}) $ tokens attend to partial history tokens and themselves, but never attend to other full tokens. Formally, index positions in $ H_u $ by $ n \in \{1, \ldots, 2T_u\} $, we define the time index $ \tau(n) = \lceil n/2 \rceil $ and token type $ \kappa(n) = \mathcal{P} $ if $ n $ is odd, otherwise $ \kappa(n) = \mathcal{F} $. Then the heterogeneous mask (as shown in Fig. 4) is
-
-
-
- $$ M_{p,q}^{\mathrm{h e t}}=\begin{cases}{1,}&{\kappa(p)=\mathcal{P},\;\kappa(q)=\mathcal{P},\;\tau(q)\leq\tau(p),}\\ {1,}&{\kappa(p)=\mathcal{F},\;\kappa(q)=\mathcal{P},\;\tau(q)\leq\tau(p),}\\ {1,}&{\kappa(p)=\mathcal{F},\;p=q,}\\ {0,}&{\mathrm{o t h e r w i s e}.}\\ \end{cases} $$
-
-##### 3.3.3. ACTION-AWARE ATTENTION: RELATIVE ENCODING AND EFFICIENT COMPUTATION
-
-On top of the heterogeneous behavior tokens and the heterogeneous visibility mask $ M^{het} $, we further adopt a action-aware RAB(i.e., relative attention bias) causal attention mechanism. It augments standard multi-head self-attention with three designs: a causal mask to prevent future leakage, a dual sliding-window visibility constraint to support streaming-style training, and a query-aware relative bias that enables the query to directly interact with relative position/time/action signals.
-
-Action-aware relative attention logits. Given a query $ q_{i} $ and a key $ k_{j} $, the attention logit is computed as
-
- $$ w_{i,j}=\boldsymbol{q}_{i}^{\top}\cdot\left(k_{j}+P o s_{i,j}+A c t i o n_{i,j}+T i m e_{i,j}\right), $$
-
-where $ Pos_{i,j} $, $ Action_{i,j} $, and $ Time_{i,j} $ are learnable embeddings derived from relative position, relative action, and relative time, respectively. For continuous or large-range
-
-
-(a) Sequence Packing
-
-
-(b) User-isolated Causal Mask
-
-
-Figure 3. Sequence packing and user-isolated causal masking in GRAB. (a) Instead of padding each impression instance to a fixed length $ L_{max} $, tokens from multiple impressions are concatenated within each user and different users are kept in disjoint segments, yielding a single packed sequence of length $ N_{token} $ for compute-efficient batching. (b) The user-isolated causal mask exhibits a block-diagonal lower-triangular pattern, so each token can only attend to past tokens within the same user segment, enforcing both user isolation and temporal causality.
-
-
-signals (e.g., action statistics or play durations), we first discretize them into buckets and then perform embedding lookup.
-
-Compared with a query-agnostic relative bias (e.g., $ w_{i,j} = q_i^\top k_j + Pos_{i,j} + \cdots $), Eq. 3 makes the relative signals action-aware via the interaction $ q_i^\top Pos_{i,j} $, $ q_i^\top Action_{i,j} $, and $ q_i^\top Time_{i,j} $, allowing the model to adaptively emphasize different contextual relations under different queries (i.e., target ads).
-
-
-
-
-Figure 4. Heterogeneous behavior tokens and heterogeneous visibility mask $ M^{het} $ (blue entries). Partial tokens attend only to partial-history tokens up to the current time, while full tokens attend to partial-history tokens up to their time index and to themselves, but never to other full tokens, preventing duplicated static information from propagating along time.
-
-
-Causal mask with dual sliding windows. We enforce causality and further restrict attention using combined time and length windows. The mask is defined as $ M_{p,q}^{\text{rab}} = 1 $ if $ q \leq p $ and the distance p - q does not exceed the length sliding-window limit $ L_w $; otherwise $ M_{p,q}^{\text{rab}} = 0 $.
-
-This serves two key industrial purposes: (1) it bounds per-token computation, guaranteeing stable throughput/latency over growing behavior histories; (2) it matches the online training paradigm—events arrive incrementally, and the model updates attention context on the fly without reprocessing the full sequence, boosting training efficiency and serving practicality.
-
-
-
-Efficient computation. The naive implementation of Eq. 3 would yield an $ O(L^2d_{\text{model}}) $ intermediate tensor, which is prohibitively memory-intensive in practice. We adopt the optimization in (Golovneva et al., 2024) to re-order the computation. We define codebooks $ B^{\text{pos/act/time}} \in \mathbb{R}^{N_s \times d_{\text{model}}} $ and bucketized indices $ p_{i,j}, a_{i,j}, t_{i,j} $. Then Eq. 3 can be equivalently written as:
-
- $$ w_{i,j}=q_{i}^{\top}k_{j}+(s_{i}^{p o s})[p_{i,j}]+(s_{i}^{a c t})[a_{i,j}]+(s_{i}^{t i m e})[t_{i,j}]. $$
-
-where $ s_{i}^{\mathrm{pos}} = q_{i}^{\top}B^{\mathrm{pos}} $, $ s_{i}^{\mathrm{act}} = q_{i}^{\top}B^{\mathrm{act}} $, and $ s_{i}^{\mathrm{time}} = q_{i}^{\top}B^{\mathrm{time}} $. In practice, we first compute the projection vectors $ s_{i}^{*} $, then obtain relative terms via fast gather operations. This completely avoids the large $ L \times L \times d_{model} $ tensor, dramatically reducing peak memory and improving computational efficiency.
-
-##### 3.3.4. MULTI-CHANNEL ATTENTION
-
-While the action-aware RAB attention (Section 3.3.3) enhances each individual attention logit with relative position/action/time signals, it still treats the packed stream as a single mixed sequence. However, in industrial logs, user behaviors are highly heterogeneous (e.g., spanning different time windows or encompassing different behavior types),
-
-
-Figure 5. Action-aware relative attention bias (RAB) with efficient computation. Left: a causal mask with dual sliding windows, which limits each query to attend only to recent past tokens visible within the sliding-window. Right: the action-aware relative encoding pipeline: relative time, position, and action signals are bucketized (as needed), embedded, summed, and injected to the attention logits.
-
-
-and different behavioral subsets often exhibit distinct temporal dynamics and predictive value. A straightforward design is to flatten all tokens into a single sequence and apply causal self-attention, yet this couples heterogeneous sources into one interaction graph and incurs a quadratic cost (e.g., $ O((n + m)^2) $ for two sources with lengths $ n $ and $ m $). To improve both modeling effectiveness and efficiency, we further introduce the Causal Action-aware Multi-channel Attention (CamA) mechanism, which integrates a multi-channel design, conceptually analogous to multi-head attention but with channel-specific visibility constraints. We therefore model each channel with an independent causal self-attention stack, and fuse the channel-wise representations via a lightweight gated mixing module. Let $ \mathcal{C} = \{1, \ldots, C\} $ denote the channel set. For each user, channel $ c $ provides a token sequence $ \mathbf{X}^{(c)} \in \mathbb{R}^{T_c \times d} $, and we append the shared target token $ X^{ad} \in \mathbb{R}^d $:
-
- $$ \mathbf{S}^{(c)}=[\mathbf{X}^{(c)};\mathbf{x}^{\mathrm{t a r}}]\in\mathbb{R}^{(T_{c}+1)\times d},\qquad t^{\star}=T_{c}+1. $$
-
-Each channel is equipped with its own causal visibility mask $ \mathbf{M}^{(c)} $, and is encoded independently:
-
- $$ \begin{align*}\mathbf{H}^{(c,\ell+1)}&=\mathrm{Layer}_{\ell}^{(c)}\Big(\tilde{\mathbf{H}}^{(c,\ell)};\mathbf{M}^{(c)}\Big),\\\mathbf{H}^{(c,0)}&=\mathbf{S}^{(c)},\quad c\in\mathcal{C}.\end{align*} $$
-
-Target-token gated mixing. To enable cross-channel information sharing while keeping computation lightweight, we perform mixing only on the target position $ t^* $ at each layer. The mixed representation $ \tilde{\mathbf{h}}^{(c,\ell)} $ is obtained by first computing channel-wise gating weights $ \beta^{(c,\ell)} $ and then aggregating information from all other channels:
-
- $$ \tilde{\mathbf{h}}^{(c,\ell)}=\mathbf{h}^{(c,\ell)}+\sum_{i\in\mathcal{C}\backslash\{c\}}\beta^{(i,\ell)}\odot\mathbf{h}^{(i,\ell)}. $$
-
-This updated representation replaces $ \mathbf{h}^{(c,\ell)} $ at position $ t^{*} $, forming the updated channel representation $ \tilde{\mathbf{H}}^{(c,\ell)} $ used in (6). Finally, the concatenated last-layer target representations from all channels are used for CTR prediction.
-
-
-
-#### 3.4. Sequence Then Sparse Training
-
-While sequence packing (Section 3.3.1) significantly enhances computational efficiency, it introduces a critical challenge: distribution skew. Since samples within a packed mini-batch belong to the same user, the high intra-user correlation leads to redundant updates for specific sparse IDs, causing the model to overfit to specific user-ad interactions, rather than learning generalizable patterns. To mitigate this, we propose the Sequence Then Sparse (STS) training paradigm (detailed discussions in Appendix D), a two-stage decoupled optimization strategy that balances long-range sequential modeling with robust sparse feature learning.
-
-##### 3.4.1. STAGE I: SEQUENCE MODELING (SEQUENCE PHASE)
-
-The first stage focuses on capturing the evolution of user interests and temporal dependencies. We perform end-to-end autoregressive-like learning on the packed user sequences Z, which include candidate tokens and their historical trajectories. In this phase, we optimize the dense tokenizer and the causal Transformer, while keeping the Sparse Embedding Table $ \Phi $ frozen. By freezing $ \Phi $, we stabilize the token space, forcing the Transformer to focus exclusively on the relational dynamics between events rather than over-memorizing specific ID features.
-
-##### 3.4.2. STAGE II: SPARSE FEATURE LEARNING (SPARSE PHASE)
-
-The second stage is designed to refine the discrete representations, particularly for long-tail IDs. In this phase, we revert to a non-sequential format, treating each sample as an independent user-ad exposure to break the distribution skewness. This stage optimizes the sparse embeddings $ \Phi $, which act as a robust corrector for the gradient accumulation amplified by sequence packing. It ensures that the basic feature representations remain accurate and unbiased across the entire traffic distribution.
-
-### 4. System Deployment
-
-GRAB has been successfully deployed in a large-scale feed ad ranking system, handling billions of daily requests. Unlike conventional memory-bound DLRMs, GR is markedly compute-bound due to the quadratic complexity of Transformer self-attention. To satisfy stringent latency requirements, we implemented a co-designed hardware-software architecture. Due to space constraints, we provide the comprehensive system overview (Fig. 8) and detailed deployment optimizations in Appendix E.### 5. Experiment
-
-#### 5.1. Overall Performance Comparison
-
-We first compared the performance of GRAB against state-of-the-art recommendation models on the Baidu real-world industrial dataset. The training data, derived from the Baidu real recommendation advertising scene, contains billions of users, exposure logs, and click logs. The test set includes millions of users, billions of exposure logs, and millions of click logs. The baselines encompass both DL-RMs and GR models, including: DIN (Zhou et al., 2018), which models short-term user behavior with target attention; SIM(Soft) (Pi et al., 2020), a sequential model that uses soft-search to encode user interests; TWIN (Si et al., 2024), which extends multi-head target attention from ESU to GSU; HSTU (Zhai et al., 2024), an efficient model for long-sequence behavior modeling; and LONGER (Chai et al., 2025), a Transformer-based architecture designed for ultra-long behavior sequences. Experimental results are presented in Table 1: GRAB outperforms all other baselines, achieving a 0.19% relative improvement over the most competitive model. Meanwhile, Fig. 6a illustrates the performance of different models across varying lengths of user behavior sequences. GRAB surpasses other recommendation models at all sequence lengths, with its performance gains becoming more pronounced as the sequence length increases.
-
-Table 1. Overall performance in industrial settings
-
-
-
-| Model | AUC |
| DIN | 0.83309 |
| SIM Soft | 0.83520 |
| TWIN | 0.83556 |
| HSTU | 0.83590 |
| LONGER | 0.83615 |
| GRAB-small | 0.83661 |
| GRAB-standard | 0.83772 |
-
-#### 5.2. Scaling Analysis
-
-We evaluate model performance across different capacity scales by independently scaling the number of Transformer blocks( $ n_{layer} $), the number of attention heads( $ n_{head} $), and the feature dimension of the model( $ d_{model} $) in Table 2, Fig. 6b presents the test-set performance of the GRAB model under varying configurations (i.e., $ n_{layer} $, $ n_{head} $ and $ d_{model} $). These results demonstrate that increasing model capacity effectively improves model performance. We also found that as the model capacity increases, the performance improvement on longer user behavior sequences becomes more significant. Moreover, no significant saturation trend is observed within the current range of configurations, which also confirms the strong scalability of the GRAB model.
-
-Table 2. Comparison of models with different settings
-
-
-
-| Model | Params | Setting |
| GRAB $ _{2l-2h-64d} $ | 6.51M | $ n_{layer}=2 $, $ n_{head}=2 $, $ d_{model}=64 $ |
| GRAB $ _{4l-2h-64d} $ | 6.67M | $ n_{layer}=4 $, $ n_{head}=2 $, $ d_{model}=64 $ |
| GRAB $ _{6l-2h-64d} $ | 6.83M | $ n_{layer}=6 $, $ n_{head}=2 $, $ d_{model}=64 $ |
| GRAB $ _{2l-4h-64d} $ | 6.48M | $ n_{layer}=2 $, $ n_{head}=4 $, $ d_{model}=64 $ |
| GRAB $ _{4l-4h-64d} $ | 6.63M | $ n_{layer}=4 $, $ n_{head}=4 $, $ d_{model}=64 $ |
| GRAB $ _{4l-4h-128d} $ | 7.05M | $ n_{layer}=4 $, $ n_{head}=4 $, $ d_{model}=128 $ |
| GRAB $ _{4l-4h-256d} $ | 8.13M | $ n_{layer}=4 $, $ n_{head}=4 $, $ d_{model}=256 $ |
| GRAB $ _{4l-4h-512d} $ | 11.27M | $ n_{layer}=4 $, $ n_{head}=4 $, $ d_{model}=512 $ |
-
-#### 5.3. Ablation Study
-
-Heterogeneous Tokens. We conduct ablation studies on heterogeneous representations with three configurations: GRAB with heterogeneous, only partial, or only full tokens (Table 3). Results show that heterogeneous representations achieve the best performance. Using only partial tokens leads to significant degradation, confirming that full feature representations are more beneficial for target scoring. Notably, using only full tokens also degrades performance, suggesting that artificially designed statistical features can introduce confusion and impair sequence modeling.
-
-Table 3. Ablation studies of GRAB
-
-
-
-| Model | AUC |
| GRAB | 0.83772 |
| GRAB w/ Partial Token | 0.83492 |
| GRAB w/ Full Token | 0.83749 |
| GRAB w/o relative pos | 0.83768 |
| GRAB w/o relative time | 0.83743 |
| GRAB w/o relative action | 0.83724 |
| GRAB w/o Multi-channel | 0.83743 |
| GRAB w/o Target-token mix | 0.83768 |
| GRAB_sparse | 0.83614 |
| GRAB_sparse w/o STS | 0.83549 |
-
-Action-aware Attention. We ablate three components of GRAB's Action-aware Attention: relative position, time, and action. The results (Table 3) show that removing any of these components degrades performance. The decline is more pronounced for time and action than for position, indicating that historical sequences are more sensitive to behavioral and temporal signals. We also analyze the attention weight distribution across buckets defined by relative position/time differences (smaller values denote more recent tokens). As shown in Figure 7, weights decrease as bucket values increase, confirming that more recent behaviors better reflect user interest and receive higher weights. For relative action, we compare positive (click) and negative (non-click) labels. The weight distribution is highly skewed: positive labels account for 88% of the total weight, versus only 12%
-
-
-(a) Overall Performance
-
-
-
-
-
-(b) Scaling Performance
-
-
-Figure 6. DLRMs vs. GRs across different user behavior sequence lengths (a), with a +0.1% improvement in AUC, indicating a significant enhancement. GRABs comparison in different parameter scale(b)
-
-
-for negative labels. This suggests that incorporating more positive feedback could further improve sequence modeling.
-
-
-
-
-Figure 7. The weight distribution of action-aware attention in relative position and relative time.
-
-
-Multi-channel Attention. To verify the effectiveness of multi-channel attention in sequence modeling, we conduct the following settings: 1) the GRAB model without multi-channel attention, that is, using a single channel for sequence modeling, 2) remain the multi-channel attention and only remove the target token mix component. As shown in Table 3, both configurations have varying degrees of performance degradation, indicating that each component is indispensable. In terms of performance, multi-channel attention is crucial, and adding the target token mix component can further improve performance.
-
-STS Training. We evaluate the STS paradigm by comparing GRAB's second-stage training with and without sequence modeling for sparse feature learning. With STS, sparse embeddings are updated through sequence modeling on packed user behavior sequences; without STS, the same batch data is treated as independent exposures. Results (as shown in Table 3) show that STS brings significant accuracy gains in sparse feature learning, confirming the efficacy of the two-stage training. This demonstrates that STS alleviates the distribution skew and overfitting caused by direct sequence-packed training.
-
-
-
-#### 5.4. Online A/B Test
-
-To assess the online performance of GRAB, we deployed it in Baidu home feed scenario of Baidu and compared its performance with the current online DLRM model. The experiment used 10% of the main traffic and remained online for about a month. Online evaluation shows that GRAB delivered 3.49% improvement in CTR and 3.05% improvement in CPM, which indicates that GRAB achieves more accurate advertising estimation and brings considerable revenue increments. Notably, GRAB has already been fully deployed on Baidu, and the online inference costs on par with the previous online DLRM model.
-
-### 6. Conclusion
-
-We propose GRAB, an end-to-end generative ranking framework that integrates a novel CamA mechanism to effectively capture temporal dynamics and specific action signals within user behavior sequences. On Baidu billion-scale industrial dataset, GRAB establishes a new state-of-the-art, outperforming DLRM and other GR baselines. Ablation studies validate the necessity of its key components, and our proposed STS training paradigm effectively mitigates distribution shift. Scaling analysis indicates continued gains from larger models and longer sequences. Finally, full online A/B testing in Baidu home feed ads shows that GRAB boosts CTR by 3.49% and CPM by 3.05%, leading to full production deployment. Further discussion of this work can be found in the Appendix F.## References
-
-Agarwal, S., Yan, C., Zhang, Z., and Venkataraman, S. Bagpipe: Accelerating deep recommendation model training. In Proceedings of the 29th Symposium on Operating Systems Principles, pp. 348–363, 2023.
-
-Bai, J., Geng, X., Deng, J., Xia, Z., Jiang, H., Yan, G., and Liang, J. A comprehensive survey on advertising click-through rate prediction algorithm. The Knowledge Engineering Review, 40:e3, 2025.
-
-Bao, K., Zhang, J., Zhang, Y., Wang, W., Feng, F., and He, X. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. In Proceedings of the 17th ACM conference on recommender systems, pp. 1007–1014, 2023.
-
-Cao, Y., Mehta, N., Yi, X., Keshavan, R. H., Heldt, L., Hong, L., Chi, E., and Sathiamoorthy, M. Aligning large language models with recommendation knowledge. In Findings of the Association for Computational Linguistics: NAACL 2024, pp. 1051–1066, 2024.
-
-Baylor, D., Breck, E., Cheng, H.-T., Fiedel, N., Foo, C. Y., Haque, Z., Haykal, S., Ispir, M., Jain, V., Koc, L., et al. Tfx: A tensorflow-based production-scale machine learning platform. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1387–1395, 2017.
-
-Chai, Z., Ren, Q., Xiao, X., Yang, H., Han, B., Zhang, S., Chen, D., Lu, H., Zhao, W., Yu, L., et al. Longer: Scaling up long sequence modeling in industrial recommenders. In Proceedings of the Nineteenth ACM Conference on Recommender Systems, pp. 247–256, 2025.
-
-Chen, J., Chi, L., Peng, B., and Yuan, Z. Hllm: Enhancing sequential recommendations via hierarchical large language models for item and user modeling. arXiv preprint arXiv:2409.12740, 2024.
-
-Cheng, H.-T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., Anderson, G., Corrado, G., Chai, W., Ispir, M., et al. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems, pp. 7–10, 2016.
-
-Covington, P., Adams, J., and Sargin, E. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, pp. 191–198, 2016.
-
-Di Palma, D., Biancofiore, G. M., Anelli, V. W., Narducci, F., Di Noia, T., and Di Sciascio, E. Evaluating chatgpt as a recommender system: A rigorous approach. arXiv preprint arXiv:2309.03613, 2023.
-
-Doan, T. T., Nguyen, L. M., Pham, N. H., and Romberg, J. Finite-time analysis of stochastic gradient descent under markov randomness. arXiv preprint arXiv:2003.10973, 2020.
-
-Geng, B., Huan, Z., Zhang, X., He, Y., Zhang, L., Yuan, F., Zhou, J., and Mo, L. Breaking the length barrier: Llm-enhanced ctr prediction in long textual user behaviors. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2311–2315, 2024.
-
-Golovneva, O., Wang, T., Weston, J., and Sukhbaatar, S. Contextual position encoding: Learning to count what's important. arXiv preprint arXiv:2405.18719, 2024.
-
-Guo, H., Tang, R., Ye, Y., Li, Z., and He, X. Deepfm: a factorization-machine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247, 2017.
-
-Han, R., Yin, B., Chen, S., Jiang, H., Jiang, F., Li, X., Ma, C., Huang, M., Li, X., Jing, C., et al. Mtgr: Industrial-scale generative recommendation framework in meituan. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management, pp. 5731–5738, 2025.
-
-He, X., Pan, J., Jin, O., Xu, T., Liu, B., Xu, T., Shi, Y., Atallah, A., Herbrich, R., Bowers, S., et al. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the eighth international workshop on data mining for online advertising, pp. 1–9, 2014.
-
-He, Z., Xie, Z., Jha, R., Steck, H., Liang, D., Feng, Y., Majumder, B. P., Kallus, N., and McAuley, J. Large language models as zero-shot conversational recommenders. In Proceedings of the 32nd ACM international conference on information and knowledge management, pp. 720–730, 2023.
-
-Hou, Y., Mu, S., Zhao, W. X., Li, Y., Ding, B., and Wen, J.-R. Towards universal sequence representation learning for recommender systems. In Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pp. 585–593, 2022.
-
-Hou, Y., He, Z., McAuley, J., and Zhao, W. X. Learning vector-quantized item representation for transferable sequential recommenders. In Proceedings of the ACM Web Conference 2023, pp. 1162–1171, 2023.
-
-Jia, J., Wang, Y., Li, Y., Chen, H., Bai, X., Liu, Z., Liang, J., Chen, Q., Li, H., Jiang, P., et al. Learn: Knowledge adaptation from large language model to recommendation for practical industrial application. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 11861–11869, 2025.Kang, W.-C. and McAuley, J. Self-attentive sequential recommendation. In 2018 IEEE international conference on data mining (ICDM), pp. 197–206. IEEE, 2018.
-
-Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
-
-Krell, M. M., Kosec, M., Perez, S. P., and Fitzgibbon, A. Efficient sequence packing without cross-contamination: Accelerating large language models without impacting performance. arXiv preprint arXiv:2107.02027, 2021.
-
-Li, L., Zhang, Y., Liu, D., and Chen, L. Large language models for generative recommendation: A survey and visionary discussions. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 10146–10159, 2024a.
-
-Li, R., Deng, W., Cheng, Y., Yuan, Z., Zhang, J., and Yuan, F. Exploring the upper limits of text-based collaborative filtering using large language models: Discoveries and insights. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management, pp. 1643–1653, 2025.
-
-Li, S., Guo, H., Tang, X., Tang, R., Hou, L., Li, R., and Zhang, R. Embedding compression in recommender systems: A survey. ACM Computing Surveys, 56(5):1–21, 2024b.
-
-Lin, J., Dai, X., Xi, Y., Liu, W., Chen, B., Zhang, H., Liu, Y., Wu, C., Li, X., Zhu, C., et al. How can recommender systems benefit from large language models: A survey. ACM Transactions on Information Systems, 43(2):1–47, 2025.
-
-Lin, Z., Ding, H., Hoang, N. T., Kveton, B., Deoras, A., and Wang, H. Pre-trained recommender systems: A causal debiasing perspective. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, pp. 424–433, 2024.
-
-Liu, J., Liu, C., Zhou, P., Lv, R., Zhou, K., and Zhang, Y. Is chatgpt a good recommender? a preliminary study. arXiv preprint arXiv:2304.10149, 2023.
-
-Luo, S., He, B., Zhao, H., Shao, W., Qi, Y., Huang, Y., Zhou, A., Yao, Y., Li, Z., Xiao, Y., et al. Recranker: Instruction tuning large language model as ranker for top-k recommendation. ACM Transactions on Information Systems, 43(5):1–31, 2025.
-
-Ma, J., Zhao, Z., Yi, X., Chen, J., Hong, L., and Chi, E. H. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th
-
-ACM SIGKDD international conference on knowledge discovery & data mining, pp. 1930–1939, 2018a.
-
-Ma, X., Zhao, L., Huang, G., Wang, Z., Hu, Z., Zhu, X., and Gai, K. Entire space multi-task model: An effective approach for estimating post-click conversion rate. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1137–1140, 2018b.
-
-Mudigere, D., Hao, Y., Huang, J., Jia, Z., Tulloch, A., Sridharan, S., Liu, X., Ozdal, M., Nie, J., Park, J., et al. Software-hardware co-design for fast and scalable training of deep learning recommendation models. In Proceedings of the 49th Annual International Symposium on Computer Architecture, pp. 993–1011, 2022.
-
-Naumov, M., Mudigere, D., Shi, H.-J. M., Huang, J., Sundaraman, N., Park, J., Wang, X., Gupta, U., Wu, C.-J., Azzolini, A. G., et al. Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091, 2019.
-
-Ning, L., Liu, L., Wu, J., Wu, N., Berlowitz, D., Prakash, S., Green, B., O'Banion, S., and Xie, J. User-llm: Efficient llm contextualization with user embeddings. In Companion Proceedings of the ACM on Web Conference 2025, pp. 1219–1223, 2025.
-
-Petrov, A. V. and Macdonald, C. Generative sequential recommendation with gptrec. arXiv preprint arXiv:2306.11114, 2023.
-
-Pi, Q., Bian, W., Zhou, G., Zhu, X., and Gai, K. Practice on long sequential user behavior modeling for click-through rate prediction. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 2671–2679, 2019.
-
-Pi, Q., Zhou, G., Zhang, Y., Wang, Z., Ren, L., Fan, Y., Zhu, X., and Gai, K. Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 2685–2692, 2020.
-
-Polyzotis, N., Zinkevich, M., Roy, S., Breck, E., and Whang, S. Data validation for machine learning. Proceedings of machine learning and systems, 1:334–347, 2019.
-
-Rajput, S., Mehta, N., Singh, A., Hulikal Keshavan, R., Vu, T., Heldt, L., Hong, L., Tay, Y., Tran, V., Samost, J., et al. Recommender systems with generative retrieval. Advances in Neural Information Processing Systems, 36: 10299–10315, 2023.Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., and Dennison, D. Hidden technical debt in machine learning systems. Advances in neural information processing systems, 28, 2015.
+The traditional DLRM architecture, as shown in Fig. 1, follows a modular processing pipeline for CTR prediction, handling raw features from users, candidate ads, and contextual signals. The pipeline involves: (a) expanding categorical features into fixed fields via feature engineering, (b) mapping these fields through hashing to obtain discrete ID vectors for embedding lookup in a Sparse Parameter Server Table (PSTable), and (c) concatenating and normalizing the retrieved embeddings to form a fixed-length flattened vector. This unified representation is then fed into an MLP, typically enhanced with a gating network, to model high-order
+Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., and Dennison, D. Hidden technical debt in machine learning systems. Advances in neural information processing systems, 28, 2015.
Sheng, X.-R., Gao, J., Cheng, Y., Yang, S., Han, S., Deng, H., Jiang, Y., Xu, J., and Zheng, B. Joint optimization of ranking and calibration with contextualized hybrid model. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 4813–4822, 2023.
@@ -373,7 +92,9 @@ Zhao, Z., Fan, W., Li, J., Liu, Y., Mei, X., Wang, Y., Wen, Z., Wang, F., Zhao,
Zhou, G., Zhu, X., Song, C., Fan, Y., Zhu, H., Ma, X., Yan, Y., Jin, J., Li, H., and Gai, K. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 1059–1068, 2018.
-Zhou, G., Mou, N., Fan, Y., Pi, Q., Bian, W., Zhou, C., Zhu, X., and Gai, K. Deep interest evolution network for click-through rate prediction. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pp. 5941–5948, 2019.Zhu, Y., Wu, L., Guo, Q., Hong, L., and Li, J. Collaborative large language model for recommender systems. In Proceedings of the ACM Web Conference 2024, pp. 3162–3172, 2024.### A. Extended Background
+Zhou, G., Mou, N., Fan, Y., Pi, Q., Bian, W., Zhou, C., Zhu, X., and Gai, K. Deep interest evolution network for click-through rate prediction. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pp. 5941–5948, 2019.
+Zhu, Y., Wu, L., Guo, Q., Hong, L., and Li, J. Collaborative large language model for recommender systems. In Proceedings of the ACM Web Conference 2024, pp. 3162–3172, 2024.
+### A. Extended Background
#### A.1. The Performance-Efficiency Trade-off in Industrial CTR Prediction
@@ -387,7 +108,8 @@ LLMs have recently emerged as a promising direction for recommendation systems,
LLM as Recommender. This category explores the direct application of LLM capabilities—such as memory, reasoning, and zero-shot generalization—to core recommendation tasks including retrieval and ranking (Wu et al., 2024; Lin et al., 2025; Xu et al., 2025). Methods in this domain typically adapt recommendation data into natural language prompts, leveraging techniques like Instruction Tuning to align the LLM with recommendation objectives (Zhu et al., 2024; Zhang et al., 2025; Bao et al., 2023; Luo et al., 2025). While these methods demonstrate promise in explainability and conversational recommendation, their performance on traditional metrics (e.g., CTR) often falls short of specialized ID-based models (Liu et al., 2023; Di Palma et al., 2023; Cao et al., 2024). In recommendation scenarios, user behavior is heavily influenced by implicit feedback and specific context rather than the explicit semantic logic found in natural language; consequently, general world-knowledge reasoning does not necessarily translate effectively to modeling complex user-item interaction patterns (Bao et al., 2023; Cao et al., 2024; Zhu et al., 2024). Furthermore, the inference latency remains a significant bottleneck for real-time industrial deployment (Xu et al., 2025).
-LLM for Representation. In this paradigm, LLMs function as sophisticated feature encoders (Lin et al., 2025; Wu et al., 2024). Instead of performing the ranking directly, the intermediate layers or final output embeddings of the LLM are extracted and utilized as semantic features to augment the input of traditional recommendation models (Sun et al., 2024; Jia et al., 2025; Geng et al., 2024; Chen et al., 2024; Ning et al., 2025). This approach aims to enhance the model's semantic understanding without bearing the full cost of LLM inference during the serving phase. LLM-derived representations significantly mitigate the limitations of discrete feature models, particularly regarding the generalization capability forlong-tail items and cold-start users/ads (Hou et al., 2022; 2023). However, this methodology faces notable limitations. There is typically a limited gain on warm items, as the strong collaborative filtering signals derived from abundant historical interactions often outweigh the semantic benefits provided by the LLM (Hou et al., 2023; Lin et al., 2024). Furthermore, employing large-scale models for representation learning introduces a high inference cost, which creates substantial latency and resource bottlenecks during both the offline feature extraction and online serving phases (Lin et al., 2025).
+LLM for Representation. In this paradigm, LLMs function as sophisticated feature encoders (Lin et al., 2025; Wu et al., 2024). Instead of performing the ranking directly, the intermediate layers or final output embeddings of the LLM are extracted and utilized as semantic features to augment the input of traditional recommendation models (Sun et al., 2024; Jia et al., 2025; Geng et al., 2024; Chen et al., 2024; Ning et al., 2025). This approach aims to enhance the model's semantic understanding without bearing the full cost of LLM inference during the serving phase. LLM-derived representations significantly mitigate the limitations of discrete feature models, particularly regarding the generalization capability for
+long-tail items and cold-start users/ads (Hou et al., 2022; 2023). However, this methodology faces notable limitations. There is typically a limited gain on warm items, as the strong collaborative filtering signals derived from abundant historical interactions often outweigh the semantic benefits provided by the LLM (Hou et al., 2023; Lin et al., 2024). Furthermore, employing large-scale models for representation learning introduces a high inference cost, which creates substantial latency and resource bottlenecks during both the offline feature extraction and online serving phases (Lin et al., 2025).
Generative Sequential Modeling. This category represents a structural adaptation rather than a direct semantic application. It borrows the architectural innovations underlying LLMs—specifically the Transformer architecture, Causal Masking, and Long-context modeling capabilities—to reconstruct recommendation systems (Vaswani et al., 2017; Kang & McAuley, 2018; Sun et al., 2019). These models (such as GR models) treat user history as a sequence and the next item prediction as a generative task, similar to next token prediction (Kang & McAuley, 2018; Petrov & Macdonald, 2023; Han et al., 2025). By employing generative sequential modeling techniques and combining them with discrete features that precisely characterize user historical behavior, these models have shown significant potential (Han et al., 2025). A key observation in this domain is the emergence of “scaling laws” within recommendation systems, where model performance metrics improve predictably as the sequence length and model capacity increase (Shin et al., 2023; Zhang et al., 2024b), mirroring the trajectory seen in NLP.
@@ -411,7 +133,8 @@ This structural mismatch leads to a performance bottleneck: the model conflates
While GR models have successfully introduced the scaling laws of LLMs into recommendation systems, their direct application to industrial CTR prediction faces distinct structural and optimization challenges.
-Mitigation of Distribution Skew in Sequence Packing. To improve training efficiency with variable-length user sequences, standard GR models often employ sequence packing techniques borrowed from NLP (e.g., concatenating multiple short sequences). However, unlike NLP where samples are generally Independent and Identically Distributed (I.I.D.), packing inrecommendation systems groups multiple interactions from the same user into a single training instance to maintain context. This creates a severe distribution skew, where a mini-batch is dominated by highly correlated samples from a few users. This correlation causes the model—especially the sparse embedding parameters—to overfit specific user identities rather than learning generalizable interaction patterns.
+Mitigation of Distribution Skew in Sequence Packing. To improve training efficiency with variable-length user sequences, standard GR models often employ sequence packing techniques borrowed from NLP (e.g., concatenating multiple short sequences). However, unlike NLP where samples are generally Independent and Identically Distributed (I.I.D.), packing in
+recommendation systems groups multiple interactions from the same user into a single training instance to maintain context. This creates a severe distribution skew, where a mini-batch is dominated by highly correlated samples from a few users. This correlation causes the model—especially the sparse embedding parameters—to overfit specific user identities rather than learning generalizable interaction patterns.
Action heterogeneity. Existing GR models often treat user history as a homogeneous token stream, neglecting the inherent heterogeneity of recommendation data. This reliance on naive serialization discards critical action semantics—distinguishing what was shown from how the user responded—thereby diluting supervision signals and limiting performance in complex industrial scenarios (as discussed in Appendix A.3).
@@ -441,7 +164,8 @@ where $ T_u $ denotes the length of user $ u $'s behavior history, $ N_u $ de
Following the discrete feature engineering standard of DLRMs, we apply a structured expansion function $ \Phi $ to transform each event into a fixed multi-field representation. Subsequently, each field value is mapped to a discrete ID via a sparse PSTable $ \Pi $. The event-level representations of the raw behavior sequence and the candidate ad sequence can be obtained as:
- $$ \begin{aligned}\mathbf{x}_{t}^{beh}&=\Pi\big(\Phi\big(S_{t}^{beh}\big)\big),\quad&t&=1,\ldots,T_{u}\\\mathbf{x}_{i}^{ad}&=\Pi\big(\Phi\big(S_{i}^{ad}\big)\big),\quad&i&=1,\ldots,N_{u}\end{aligned} $$ #### C.2. Dense Tokenizer
+ $$ \begin{aligned}\mathbf{x}_{t}^{beh}&=\Pi\big(\Phi\big(S_{t}^{beh}\big)\big),\quad&t&=1,\ldots,T_{u}\\\mathbf{x}_{i}^{ad}&=\Pi\big(\Phi\big(S_{i}^{ad}\big)\big),\quad&i&=1,\ldots,N_{u}\end{aligned} $$
+#### C.2. Dense Tokenizer
Unlike the DLRM approach, which concatenates all field embeddings into a fixed-length flattened vector, GRAB preserves the event structure by aggregating field embeddings within each event into a single event token. This yields a time-ordered token sequence that is fed into a Transformer to capture long-range behavioral dependencies and interest drift. Given the structured discrete ID sequences from Section C.1, GRAB converts each event into a dense token for Transformer-based sequential modeling. Specifically, each event is first transformed into a vector through a field-wise embedding lookup followed by a multi-field fusion process, as follows:
@@ -477,7 +201,8 @@ In sequence packing, we form a packed mini-batch $ \mathcal{B}_{\text{pack}} $
This issue is most damaging for the sparse embedding table $ \Phi $. Since a packed batch repeatedly contains the same user features (e.g., user_id=123 appears L times along the packed sequence), the update for that single embedding vector is amplified by repeated contributions:
- $$ \Delta\Phi_{u}\propto\sum_{t=1}^{L}\nabla\mathcal{L}_{t}. $$ Such oversized, user-specific updates encourage $ \Phi $ to memorize individual trajectories rather than learn generalizable interaction patterns. Meanwhile, the dense sequence model (e.g., Transformer) suffers from batch-to-batch distribution skew: consecutive packed batches may be dominated by different users (User A $ \rightarrow $ User B), causing abrupt shifts in inputs and gradients, which hinders stable convergence of sequential reasoning parameters.
+ $$ \Delta\Phi_{u}\propto\sum_{t=1}^{L}\nabla\mathcal{L}_{t}. $$
+Such oversized, user-specific updates encourage $ \Phi $ to memorize individual trajectories rather than learn generalizable interaction patterns. Meanwhile, the dense sequence model (e.g., Transformer) suffers from batch-to-batch distribution skew: consecutive packed batches may be dominated by different users (User A $ \rightarrow $ User B), causing abrupt shifts in inputs and gradients, which hinders stable convergence of sequential reasoning parameters.
#### D.2. Formalization of STS Stages
@@ -541,7 +266,8 @@ As illustrated in Fig. 8, the proposed system is implemented within a comprehens
##### E.1.1. ONLINE SERVING
-The online serving component processes user interactions in real-time. The workflow initiates with a Page View (PV) Request, which sequentially passes through matching and ranking phases to select appropriate advertisements.The core of the ranking mechanism involves a CTR Prediction module that relies on two primary inputs:
+The online serving component processes user interactions in real-time. The workflow initiates with a Page View (PV) Request, which sequentially passes through matching and ranking phases to select appropriate advertisements.
+The core of the ranking mechanism involves a CTR Prediction module that relies on two primary inputs:
• User Representation: A user model processes historical tokens and maintains a KV-cache to efficiently manage state.
@@ -581,7 +307,8 @@ Hierarchical Parameter Server (PaddleBox). To handle terabyte-scale embedding ta
• L2 (CPU DRAM): Buffers warm parameters.
-• L3 (SSD): Utilizes NVMe SSDs for massive long-tail feature embeddings. An intelligent prefetching engine asynchronously moves parameters between tiers, masking SSD I/O latency.
+• L3 (SSD): Utilizes NVMe SSDs for massive long-tail feature embeddings. An intelligent prefetching engine asynchronously moves parameters between tiers, masking SSD I/O latency.
+
Figure 8. Overview of an online advertising CTR system with an online-offline closed loop. Online services handle PV requests via matching and ranking, and feed the CTR predictor with user-side historical tokens (maintained by a user model with KV-cache) and candidate ad tokens; user interactions are continuously logged as an impression/action logs. Offline services collect these logs, apply sparse feature engineering, group training samples by user ID, and perform offline training; updated models are released (e.g., hourly) back to online.
@@ -599,7 +326,8 @@ Operator Fusion and Mixed Precision. To maximize throughput on GPUs, we employed
#### E.5. Data Consistency
-A major challenge in GRs is the Freshness Gap (Train-Serve Skew). We addressed this by implementing a streaming data pipeline based on Flink & TableStore. We utilized a Global Strictly Incremental ID mechanism to ensure strict ordering of user actions across distributed nodes. This allows the inference engine to fetch the exact state of the user corresponding to the training checkpoint, reducing data synchronization delay from minutes to seconds and ensuring the model always predicts based on the most consistent context.### F. Discussion
+A major challenge in GRs is the Freshness Gap (Train-Serve Skew). We addressed this by implementing a streaming data pipeline based on Flink & TableStore. We utilized a Global Strictly Incremental ID mechanism to ensure strict ordering of user actions across distributed nodes. This allows the inference engine to fetch the exact state of the user corresponding to the training checkpoint, reducing data synchronization delay from minutes to seconds and ensuring the model always predicts based on the most consistent context.
+### F. Discussion
#### F.1. Limitations and Challenges
@@ -615,4 +343,295 @@ Towards Multimodal Generative Ranking. Currently, GRAB operates on discretized I
Unified Generative Representation Across Domains. Finally, the “pre-training & fine-tuning” paradigm common in NLP has yet to be fully realized in industrial recommendation. We envision extending GRAB to learn a Universal User Representation by pre-training on diverse behavior logs across multiple business scenarios (e.g., Home Feed, Search, and Short Video). A unified GRAB model could transfer learned sequential patterns from data-rich domains to cold-start scenarios, effectively solving the “data silo” problem prevalent in large-scale platforms.
-Foundation for Agent-based Recommender Systems. GRAB's ability to model the transition probabilities of user states $ (s_{t} \rightarrow s_{t+1}) $ positions it as a powerful “World Model” or User Simulator for future agent-based recommendation systems. By accurately predicting not just the next click, but the evolution of user interests over time, GRAB can serve as the environment model for Reinforcement Learning (RL) agents. This would allow the system to move beyond myopic CTR optimization toward maximizing Long-Term Value (LTV) or user satisfaction by simulating how current recommendations influence future user trajectories.
\ No newline at end of file
+Foundation for Agent-based Recommender Systems. GRAB's ability to model the transition probabilities of user states $ (s_{t} \rightarrow s_{t+1}) $ positions it as a powerful “World Model” or User Simulator for future agent-based recommendation systems. By accurately predicting not just the next click, but the evolution of user interests over time, GRAB can serve as the environment model for Reinforcement Learning (RL) agents. This would allow the system to move beyond myopic CTR optimization toward maximizing Long-Term Value (LTV) or user satisfaction by simulating how current recommendations influence future user trajectories.
+feature interactions and generate the final CTR prediction.
+
+
+
+
+Figure 1. The traditional DLRM architecture: sparse features are hashed to IDs and embedded via PSTable, and then concatenated into a fixed-length flattened vector for CTR prediction.
+
+
+#### 3.2. Overall Architecture of GRAB
+
+GRAB, with the overall architecture shown in Fig. 2, is designed to model user behavior history sequences in an end-to-end manner, as applied in scenarios like CTR prediction. GRAB follows a three-stage pipeline: (i) sparse feature layer; (ii) dense tokenizer; and (iii) sequence modeling layer. Given raw behavior logs, GRAB first converts heterogeneous categorical signals into sparse IDs at the event level, then tokenizes each event into a dense representation, and finally applies a sequence model to estimate the click probability of candidate ads. GRAB uses its dense representation calculated from the dense tokenizer to bridge DLRM-style sparse feature engineering and GR-style sequential modeling, enabling end-to-end training and inference along a single, unified computation path from input to output, thereby improving CTR prediction performance through end-to-end sequential modeling of event-level user behaviors.
+
+Sparse Feature Layer. The sparse feature layer (details in Appendix C.1) processes raw logs into time-ordered event sequences. Each event's categorical fields are converted into sparse IDs using standard DLRM feature engineering (Section 3.1), yielding a structured sequence of events annotated with field-wise IDs.
+
+Dense Tokenizer. Unlike DLRM, which collapses field embeddings into a fixed-length, order-agnostic vector for pointwise processing, GRAB preserves the temporal event structure. It aggregates per-event field embeddings and projects them into $ \mathbb{R}^{d_{model}} $ to form sequential event tokens (Appendix C.2), resulting in a time-ordered token sequence. This sequence serves as the input to a subsequent Transformer, thereby enabling the modeling of long-range dependencies and interest drift.
+
+Autoregressive-like Sequence Modeling Layer. Built on sequence packing (Section 3.3.1), heterogeneous tokens (Section 3.3.2), and action-aware relative attention bias (Section 3.3.3), our core contribution is the CamA mechanism (Section 3.3.4). CamA integrates a multi-channel design for parallel processing of diverse behaviors and inherits action-aware contextualization from RAB, providing a unified and efficient framework for modeling complex user interest patterns across scenarios.
+
+
+
+#### 3.3. Autoregressive-like Sequence Modeling Layer
+
+Following the dense tokenizer, this layer is designed to capture the temporal dependencies and dynamic evolution of user interests, which takes the sequence of dense event tokens generated by the preceding layer as input (as described in Appendix C.3). Formally, for a user $ u $, the input sequence consists of the behavior history $ \mathbf{E}^{\mathrm{beh}} = \{e_t^{\mathrm{beh}}\}_{t=1}^T $ and the candidate advertisements $ \mathbf{E}^{\mathrm{ad}} = \{e_i^{\mathrm{ad}}\}_{i=1}^N_u $, where $ \mathbf{e}_t^{\mathrm{beh}}, \mathbf{e}_i^{\mathrm{ad}} \in \mathbb{R}^{d_{\mathrm{model}}} $ are the dense embeddings of the $ t $-th behavior event and the $ i $-th candidate ad, respectively, $ T_u $ is the behavior history length, and $ N_u $ is the number of candidate ads.
+
+##### 3.3.1. SEQUENCE PACKING AND USER-ISOLATED CAUSAL MASK
+
+In industrial training logs, as shown in the left image of Fig. 3a., a mini-batch is typically formed by sampling $ B_{ins} $ impression instances. Each instance contains a variable-length token sequence composed of (i) a subsequence of the user's historical behavior tokens and (ii) target advertisement tokens to be scored. A straightforward batching strategy pads every instance to a fixed length $ L_{max} $, yielding a dense tensor with dimensions $ B_{ins} \times L_{max} \times d_{model} $,
+
+which introduces substantial computational waste when most instances are much shorter than $ L_{max} $.
+
+To eliminate such padding overhead while preserving the temporal semantics, GRAB performs sequence packing by grouping tokens by user. Specifically, tokens from multiple impression instances belonging to the same user u are merged into a single contiguous token segment, while segments of different users are strictly separated. Within each user segment, all tokens are stably sorted by timestamp so that the packed segment forms a single timeline for sequential modeling. After packing, the batch is represented as one long packed tensor $ H = \text{Pack}(\mathbf{E}^{beh}, \mathbf{E}^{ad}) \in \mathbb{R}^{1 \times L \times d_{model}} $, where $ L $ denotes the total packed length across all users in the mini-batch.
+
+For convenience, we associate each packed position $ p \in \{1, \ldots, L\} $ with (i) a segment $ id \sigma(p) \in U_B $ indicating which user it belongs to, and (ii) a local time index $ \ell(p) \in \{1, \ldots, L_{\sigma(p)}\} $ within that user segment.
+
+User-isolated causal mask. On the packed tensor $ H $, we construct an additive attention mask $ M^{\text{pack}} \in \mathbb{R}^{L \times L} $ that
+
+
+
+Figure 2. Overview of GRAB's end-to-end CTR prediction pipeline: (1) Tokenizing raw fields via a sparse PSTable and fusing them into event tokens. (2) Packing tokens per user with causal and heterogeneous masks. (3) Processing through N Transformer layers equipped with the Causal Action-aware Multi-channel Attention (CamA) mechanism. (4) Final CTR prediction from the output representations.
+
+
+enforces two constraints: (1) user isolation (no cross-user attention), and (2) causality within each user's timeline (no future leakage). Formally, for query position p and key position q,
+
+ $$ M_{p,q}^{\mathrm{pack}}=\begin{cases}1,&if\sigma(p)=\sigma(q)and\ell(q)\leq\ell(p),\\0,&otherwise.\end{cases} $$
+
+This yields a block-diagonal lower-triangular structure (as shown in Fig. 3b), where each block corresponds to one user segment.
+
+##### 3.3.2. HETEROGENEOUS BEHAVIOR TOKENS AND HETEROGENEOUS VISIBILITY MASK
+
+After sequence packing, for each user $u$, we obtain a user-isolated, time-ordered packed stream with its causal mask $M_{\text{pack}}$. To further reduce redundancy in the packed history while preserving the information needed for scoring the current candidate, we instantiate two token views at each packed timestamp $t$: the partial token (history) $h_t \in \mathbb{R}^{d_{\text{model}}}$, which retains only time-varying information that is useful for representing history and discards static or highly repetitive fields (e.g., user_id) that would otherwise be duplicated across historical steps and could lead to overfitting; and the full token (candidate) $h'_t \in \mathbb{R}^{d_{\text{model}}}$, which retains the complete information required to score the candidate at time $t$, including the static fields omitted from the partial history view. We then interleave them to form the heterogeneous packed sequence: $H_u = [\mathbf{h}_1, \mathbf{h}_1', \mathbf{h}_2, \mathbf{h}_2', \ldots, \mathbf{h}_{T_u}, \mathbf{h}_{T_u}']$.
+
+Heterogeneous Visibility Mask. On top of the user-isolated causal constraint encoded by $ M^{pack} $, we apply a mask-rewriting operator $ \mathcal{R}(\cdot) $ to obtain the heterogeneous visibility mask $ M^{het} $. Concretely, $ \mathcal{R}(\cdot) $ rewrites the visibility pattern according to the token types in the following way: (i) partial $ (\mathcal{P}) $ tokens only attend to partial history tokens; and (ii) full $ (\mathcal{F}) $ tokens attend to partial history tokens and themselves, but never attend to other full tokens. Formally, index positions in $ H_u $ by $ n \in \{1, \ldots, 2T_u\} $, we define the time index $ \tau(n) = \lceil n/2 \rceil $ and token type $ \kappa(n) = \mathcal{P} $ if $ n $ is odd, otherwise $ \kappa(n) = \mathcal{F} $. Then the heterogeneous mask (as shown in Fig. 4) is
+
+
+
+ $$ M_{p,q}^{\mathrm{h e t}}=\begin{cases}{1,}&{\kappa(p)=\mathcal{P},\;\kappa(q)=\mathcal{P},\;\tau(q)\leq\tau(p),}\\ {1,}&{\kappa(p)=\mathcal{F},\;\kappa(q)=\mathcal{P},\;\tau(q)\leq\tau(p),}\\ {1,}&{\kappa(p)=\mathcal{F},\;p=q,}\\ {0,}&{\mathrm{o t h e r w i s e}.}\\ \end{cases} $$
+
+##### 3.3.3. ACTION-AWARE ATTENTION: RELATIVE ENCODING AND EFFICIENT COMPUTATION
+
+On top of the heterogeneous behavior tokens and the heterogeneous visibility mask $ M^{het} $, we further adopt a action-aware RAB(i.e., relative attention bias) causal attention mechanism. It augments standard multi-head self-attention with three designs: a causal mask to prevent future leakage, a dual sliding-window visibility constraint to support streaming-style training, and a query-aware relative bias that enables the query to directly interact with relative position/time/action signals.
+
+Action-aware relative attention logits. Given a query $ q_{i} $ and a key $ k_{j} $, the attention logit is computed as
+
+ $$ w_{i,j}=\boldsymbol{q}_{i}^{\top}\cdot\left(k_{j}+P o s_{i,j}+A c t i o n_{i,j}+T i m e_{i,j}\right), $$
+
+where $ Pos_{i,j} $, $ Action_{i,j} $, and $ Time_{i,j} $ are learnable embeddings derived from relative position, relative action, and relative time, respectively. For continuous or large-range
+
+
+
+(a) Sequence Packing
+
+
+(b) User-isolated Causal Mask
+
+
+Figure 3. Sequence packing and user-isolated causal masking in GRAB. (a) Instead of padding each impression instance to a fixed length $ L_{max} $, tokens from multiple impressions are concatenated within each user and different users are kept in disjoint segments, yielding a single packed sequence of length $ N_{token} $ for compute-efficient batching. (b) The user-isolated causal mask exhibits a block-diagonal lower-triangular pattern, so each token can only attend to past tokens within the same user segment, enforcing both user isolation and temporal causality.
+
+
+signals (e.g., action statistics or play durations), we first discretize them into buckets and then perform embedding lookup.
+
+Compared with a query-agnostic relative bias (e.g., $ w_{i,j} = q_i^\top k_j + Pos_{i,j} + \cdots $), Eq. 3 makes the relative signals action-aware via the interaction $ q_i^\top Pos_{i,j} $, $ q_i^\top Action_{i,j} $, and $ q_i^\top Time_{i,j} $, allowing the model to adaptively emphasize different contextual relations under different queries (i.e., target ads).
+
+
+
+
+Figure 4. Heterogeneous behavior tokens and heterogeneous visibility mask $ M^{het} $ (blue entries). Partial tokens attend only to partial-history tokens up to the current time, while full tokens attend to partial-history tokens up to their time index and to themselves, but never to other full tokens, preventing duplicated static information from propagating along time.
+
+
+Causal mask with dual sliding windows. We enforce causality and further restrict attention using combined time and length windows. The mask is defined as $ M_{p,q}^{\text{rab}} = 1 $ if $ q \leq p $ and the distance p - q does not exceed the length sliding-window limit $ L_w $; otherwise $ M_{p,q}^{\text{rab}} = 0 $.
+
+This serves two key industrial purposes: (1) it bounds per-token computation, guaranteeing stable throughput/latency over growing behavior histories; (2) it matches the online training paradigm—events arrive incrementally, and the model updates attention context on the fly without reprocessing the full sequence, boosting training efficiency and serving practicality.
+
+
+
+Efficient computation. The naive implementation of Eq. 3 would yield an $ O(L^2d_{\text{model}}) $ intermediate tensor, which is prohibitively memory-intensive in practice. We adopt the optimization in (Golovneva et al., 2024) to re-order the computation. We define codebooks $ B^{\text{pos/act/time}} \in \mathbb{R}^{N_s \times d_{\text{model}}} $ and bucketized indices $ p_{i,j}, a_{i,j}, t_{i,j} $. Then Eq. 3 can be equivalently written as:
+
+ $$ w_{i,j}=q_{i}^{\top}k_{j}+(s_{i}^{p o s})[p_{i,j}]+(s_{i}^{a c t})[a_{i,j}]+(s_{i}^{t i m e})[t_{i,j}]. $$
+
+where $ s_{i}^{\mathrm{pos}} = q_{i}^{\top}B^{\mathrm{pos}} $, $ s_{i}^{\mathrm{act}} = q_{i}^{\top}B^{\mathrm{act}} $, and $ s_{i}^{\mathrm{time}} = q_{i}^{\top}B^{\mathrm{time}} $. In practice, we first compute the projection vectors $ s_{i}^{*} $, then obtain relative terms via fast gather operations. This completely avoids the large $ L \times L \times d_{model} $ tensor, dramatically reducing peak memory and improving computational efficiency.
+
+##### 3.3.4. MULTI-CHANNEL ATTENTION
+
+While the action-aware RAB attention (Section 3.3.3) enhances each individual attention logit with relative position/action/time signals, it still treats the packed stream as a single mixed sequence. However, in industrial logs, user behaviors are highly heterogeneous (e.g., spanning different time windows or encompassing different behavior types),
+
+
+
+Figure 5. Action-aware relative attention bias (RAB) with efficient computation. Left: a causal mask with dual sliding windows, which limits each query to attend only to recent past tokens visible within the sliding-window. Right: the action-aware relative encoding pipeline: relative time, position, and action signals are bucketized (as needed), embedded, summed, and injected to the attention logits.
+
+
+and different behavioral subsets often exhibit distinct temporal dynamics and predictive value. A straightforward design is to flatten all tokens into a single sequence and apply causal self-attention, yet this couples heterogeneous sources into one interaction graph and incurs a quadratic cost (e.g., $ O((n + m)^2) $ for two sources with lengths $ n $ and $ m $). To improve both modeling effectiveness and efficiency, we further introduce the Causal Action-aware Multi-channel Attention (CamA) mechanism, which integrates a multi-channel design, conceptually analogous to multi-head attention but with channel-specific visibility constraints. We therefore model each channel with an independent causal self-attention stack, and fuse the channel-wise representations via a lightweight gated mixing module. Let $ \mathcal{C} = \{1, \ldots, C\} $ denote the channel set. For each user, channel $ c $ provides a token sequence $ \mathbf{X}^{(c)} \in \mathbb{R}^{T_c \times d} $, and we append the shared target token $ X^{ad} \in \mathbb{R}^d $:
+
+ $$ \mathbf{S}^{(c)}=[\mathbf{X}^{(c)};\mathbf{x}^{\mathrm{t a r}}]\in\mathbb{R}^{(T_{c}+1)\times d},\qquad t^{\star}=T_{c}+1. $$
+
+Each channel is equipped with its own causal visibility mask $ \mathbf{M}^{(c)} $, and is encoded independently:
+
+ $$ \begin{align*}\mathbf{H}^{(c,\ell+1)}&=\mathrm{Layer}_{\ell}^{(c)}\Big(\tilde{\mathbf{H}}^{(c,\ell)};\mathbf{M}^{(c)}\Big),\\\mathbf{H}^{(c,0)}&=\mathbf{S}^{(c)},\quad c\in\mathcal{C}.\end{align*} $$
+
+Target-token gated mixing. To enable cross-channel information sharing while keeping computation lightweight, we perform mixing only on the target position $ t^* $ at each layer. The mixed representation $ \tilde{\mathbf{h}}^{(c,\ell)} $ is obtained by first computing channel-wise gating weights $ \beta^{(c,\ell)} $ and then aggregating information from all other channels:
+
+ $$ \tilde{\mathbf{h}}^{(c,\ell)}=\mathbf{h}^{(c,\ell)}+\sum_{i\in\mathcal{C}\backslash\{c\}}\beta^{(i,\ell)}\odot\mathbf{h}^{(i,\ell)}. $$
+
+This updated representation replaces $ \mathbf{h}^{(c,\ell)} $ at position $ t^{*} $, forming the updated channel representation $ \tilde{\mathbf{H}}^{(c,\ell)} $ used in (6). Finally, the concatenated last-layer target representations from all channels are used for CTR prediction.
+
+
+
+#### 3.4. Sequence Then Sparse Training
+
+While sequence packing (Section 3.3.1) significantly enhances computational efficiency, it introduces a critical challenge: distribution skew. Since samples within a packed mini-batch belong to the same user, the high intra-user correlation leads to redundant updates for specific sparse IDs, causing the model to overfit to specific user-ad interactions, rather than learning generalizable patterns. To mitigate this, we propose the Sequence Then Sparse (STS) training paradigm (detailed discussions in Appendix D), a two-stage decoupled optimization strategy that balances long-range sequential modeling with robust sparse feature learning.
+
+##### 3.4.1. STAGE I: SEQUENCE MODELING (SEQUENCE PHASE)
+
+The first stage focuses on capturing the evolution of user interests and temporal dependencies. We perform end-to-end autoregressive-like learning on the packed user sequences Z, which include candidate tokens and their historical trajectories. In this phase, we optimize the dense tokenizer and the causal Transformer, while keeping the Sparse Embedding Table $ \Phi $ frozen. By freezing $ \Phi $, we stabilize the token space, forcing the Transformer to focus exclusively on the relational dynamics between events rather than over-memorizing specific ID features.
+
+##### 3.4.2. STAGE II: SPARSE FEATURE LEARNING (SPARSE PHASE)
+
+The second stage is designed to refine the discrete representations, particularly for long-tail IDs. In this phase, we revert to a non-sequential format, treating each sample as an independent user-ad exposure to break the distribution skewness. This stage optimizes the sparse embeddings $ \Phi $, which act as a robust corrector for the gradient accumulation amplified by sequence packing. It ensures that the basic feature representations remain accurate and unbiased across the entire traffic distribution.
+
+### 4. System Deployment
+
+GRAB has been successfully deployed in a large-scale feed ad ranking system, handling billions of daily requests. Unlike conventional memory-bound DLRMs, GR is markedly compute-bound due to the quadratic complexity of Transformer self-attention. To satisfy stringent latency requirements, we implemented a co-designed hardware-software architecture. Due to space constraints, we provide the comprehensive system overview (Fig. 8) and detailed deployment optimizations in Appendix E.
+### 5. Experiment
+
+#### 5.1. Overall Performance Comparison
+
+We first compared the performance of GRAB against state-of-the-art recommendation models on the Baidu real-world industrial dataset. The training data, derived from the Baidu real recommendation advertising scene, contains billions of users, exposure logs, and click logs. The test set includes millions of users, billions of exposure logs, and millions of click logs. The baselines encompass both DL-RMs and GR models, including: DIN (Zhou et al., 2018), which models short-term user behavior with target attention; SIM(Soft) (Pi et al., 2020), a sequential model that uses soft-search to encode user interests; TWIN (Si et al., 2024), which extends multi-head target attention from ESU to GSU; HSTU (Zhai et al., 2024), an efficient model for long-sequence behavior modeling; and LONGER (Chai et al., 2025), a Transformer-based architecture designed for ultra-long behavior sequences. Experimental results are presented in Table 1: GRAB outperforms all other baselines, achieving a 0.19% relative improvement over the most competitive model. Meanwhile, Fig. 6a illustrates the performance of different models across varying lengths of user behavior sequences. GRAB surpasses other recommendation models at all sequence lengths, with its performance gains becoming more pronounced as the sequence length increases.
+
+Table 1. Overall performance in industrial settings
+
+
+
+| Model | AUC |
| DIN | 0.83309 |
| SIM Soft | 0.83520 |
| TWIN | 0.83556 |
| HSTU | 0.83590 |
| LONGER | 0.83615 |
| GRAB-small | 0.83661 |
| GRAB-standard | 0.83772 |
+
+#### 5.2. Scaling Analysis
+
+We evaluate model performance across different capacity scales by independently scaling the number of Transformer blocks( $ n_{layer} $), the number of attention heads( $ n_{head} $), and the feature dimension of the model( $ d_{model} $) in Table 2, Fig. 6b presents the test-set performance of the GRAB model under varying configurations (i.e., $ n_{layer} $, $ n_{head} $ and $ d_{model} $). These results demonstrate that increasing model capacity effectively improves model performance. We also found that as the model capacity increases, the performance improvement on longer user behavior sequences becomes more significant. Moreover, no significant saturation trend is observed within the current range of configurations, which also confirms the strong scalability of the GRAB model.
+
+Table 2. Comparison of models with different settings
+
+
+
+| Model | Params | Setting |
| GRAB $ _{2l-2h-64d} $ | 6.51M | $ n_{layer}=2 $, $ n_{head}=2 $, $ d_{model}=64 $ |
| GRAB $ _{4l-2h-64d} $ | 6.67M | $ n_{layer}=4 $, $ n_{head}=2 $, $ d_{model}=64 $ |
| GRAB $ _{6l-2h-64d} $ | 6.83M | $ n_{layer}=6 $, $ n_{head}=2 $, $ d_{model}=64 $ |
| GRAB $ _{2l-4h-64d} $ | 6.48M | $ n_{layer}=2 $, $ n_{head}=4 $, $ d_{model}=64 $ |
| GRAB $ _{4l-4h-64d} $ | 6.63M | $ n_{layer}=4 $, $ n_{head}=4 $, $ d_{model}=64 $ |
| GRAB $ _{4l-4h-128d} $ | 7.05M | $ n_{layer}=4 $, $ n_{head}=4 $, $ d_{model}=128 $ |
| GRAB $ _{4l-4h-256d} $ | 8.13M | $ n_{layer}=4 $, $ n_{head}=4 $, $ d_{model}=256 $ |
| GRAB $ _{4l-4h-512d} $ | 11.27M | $ n_{layer}=4 $, $ n_{head}=4 $, $ d_{model}=512 $ |
+
+#### 5.3. Ablation Study
+
+Heterogeneous Tokens. We conduct ablation studies on heterogeneous representations with three configurations: GRAB with heterogeneous, only partial, or only full tokens (Table 3). Results show that heterogeneous representations achieve the best performance. Using only partial tokens leads to significant degradation, confirming that full feature representations are more beneficial for target scoring. Notably, using only full tokens also degrades performance, suggesting that artificially designed statistical features can introduce confusion and impair sequence modeling.
+
+Table 3. Ablation studies of GRAB
+
+
+
+| Model | AUC |
| GRAB | 0.83772 |
| GRAB w/ Partial Token | 0.83492 |
| GRAB w/ Full Token | 0.83749 |
| GRAB w/o relative pos | 0.83768 |
| GRAB w/o relative time | 0.83743 |
| GRAB w/o relative action | 0.83724 |
| GRAB w/o Multi-channel | 0.83743 |
| GRAB w/o Target-token mix | 0.83768 |
| GRAB_sparse | 0.83614 |
| GRAB_sparse w/o STS | 0.83549 |
+
+Action-aware Attention. We ablate three components of GRAB's Action-aware Attention: relative position, time, and action. The results (Table 3) show that removing any of these components degrades performance. The decline is more pronounced for time and action than for position, indicating that historical sequences are more sensitive to behavioral and temporal signals. We also analyze the attention weight distribution across buckets defined by relative position/time differences (smaller values denote more recent tokens). As shown in Figure 7, weights decrease as bucket values increase, confirming that more recent behaviors better reflect user interest and receive higher weights. For relative action, we compare positive (click) and negative (non-click) labels. The weight distribution is highly skewed: positive labels account for 88% of the total weight, versus only 12%
+
+
+
+(a) Overall Performance
+
+
+
+
+
+(b) Scaling Performance
+
+
+Figure 6. DLRMs vs. GRs across different user behavior sequence lengths (a), with a +0.1% improvement in AUC, indicating a significant enhancement. GRABs comparison in different parameter scale(b)
+
+
+for negative labels. This suggests that incorporating more positive feedback could further improve sequence modeling.
+
+
+
+
+Figure 7. The weight distribution of action-aware attention in relative position and relative time.
+
+
+Multi-channel Attention. To verify the effectiveness of multi-channel attention in sequence modeling, we conduct the following settings: 1) the GRAB model without multi-channel attention, that is, using a single channel for sequence modeling, 2) remain the multi-channel attention and only remove the target token mix component. As shown in Table 3, both configurations have varying degrees of performance degradation, indicating that each component is indispensable. In terms of performance, multi-channel attention is crucial, and adding the target token mix component can further improve performance.
+
+STS Training. We evaluate the STS paradigm by comparing GRAB's second-stage training with and without sequence modeling for sparse feature learning. With STS, sparse embeddings are updated through sequence modeling on packed user behavior sequences; without STS, the same batch data is treated as independent exposures. Results (as shown in Table 3) show that STS brings significant accuracy gains in sparse feature learning, confirming the efficacy of the two-stage training. This demonstrates that STS alleviates the distribution skew and overfitting caused by direct sequence-packed training.
+
+
+
+#### 5.4. Online A/B Test
+
+To assess the online performance of GRAB, we deployed it in Baidu home feed scenario of Baidu and compared its performance with the current online DLRM model. The experiment used 10% of the main traffic and remained online for about a month. Online evaluation shows that GRAB delivered 3.49% improvement in CTR and 3.05% improvement in CPM, which indicates that GRAB achieves more accurate advertising estimation and brings considerable revenue increments. Notably, GRAB has already been fully deployed on Baidu, and the online inference costs on par with the previous online DLRM model.
+
+### 6. Conclusion
+
+We propose GRAB, an end-to-end generative ranking framework that integrates a novel CamA mechanism to effectively capture temporal dynamics and specific action signals within user behavior sequences. On Baidu billion-scale industrial dataset, GRAB establishes a new state-of-the-art, outperforming DLRM and other GR baselines. Ablation studies validate the necessity of its key components, and our proposed STS training paradigm effectively mitigates distribution shift. Scaling analysis indicates continued gains from larger models and longer sequences. Finally, full online A/B testing in Baidu home feed ads shows that GRAB boosts CTR by 3.49% and CPM by 3.05%, leading to full production deployment. Further discussion of this work can be found in the Appendix F.
+## References
+
+Agarwal, S., Yan, C., Zhang, Z., and Venkataraman, S. Bagpipe: Accelerating deep recommendation model training. In Proceedings of the 29th Symposium on Operating Systems Principles, pp. 348–363, 2023.
+
+Bai, J., Geng, X., Deng, J., Xia, Z., Jiang, H., Yan, G., and Liang, J. A comprehensive survey on advertising click-through rate prediction algorithm. The Knowledge Engineering Review, 40:e3, 2025.
+
+Bao, K., Zhang, J., Zhang, Y., Wang, W., Feng, F., and He, X. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. In Proceedings of the 17th ACM conference on recommender systems, pp. 1007–1014, 2023.
+
+Cao, Y., Mehta, N., Yi, X., Keshavan, R. H., Heldt, L., Hong, L., Chi, E., and Sathiamoorthy, M. Aligning large language models with recommendation knowledge. In Findings of the Association for Computational Linguistics: NAACL 2024, pp. 1051–1066, 2024.
+
+Baylor, D., Breck, E., Cheng, H.-T., Fiedel, N., Foo, C. Y., Haque, Z., Haykal, S., Ispir, M., Jain, V., Koc, L., et al. Tfx: A tensorflow-based production-scale machine learning platform. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1387–1395, 2017.
+
+Chai, Z., Ren, Q., Xiao, X., Yang, H., Han, B., Zhang, S., Chen, D., Lu, H., Zhao, W., Yu, L., et al. Longer: Scaling up long sequence modeling in industrial recommenders. In Proceedings of the Nineteenth ACM Conference on Recommender Systems, pp. 247–256, 2025.
+
+Chen, J., Chi, L., Peng, B., and Yuan, Z. Hllm: Enhancing sequential recommendations via hierarchical large language models for item and user modeling. arXiv preprint arXiv:2409.12740, 2024.
+
+Cheng, H.-T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., Anderson, G., Corrado, G., Chai, W., Ispir, M., et al. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems, pp. 7–10, 2016.
+
+Covington, P., Adams, J., and Sargin, E. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, pp. 191–198, 2016.
+
+Di Palma, D., Biancofiore, G. M., Anelli, V. W., Narducci, F., Di Noia, T., and Di Sciascio, E. Evaluating chatgpt as a recommender system: A rigorous approach. arXiv preprint arXiv:2309.03613, 2023.
+
+Doan, T. T., Nguyen, L. M., Pham, N. H., and Romberg, J. Finite-time analysis of stochastic gradient descent under markov randomness. arXiv preprint arXiv:2003.10973, 2020.
+
+Geng, B., Huan, Z., Zhang, X., He, Y., Zhang, L., Yuan, F., Zhou, J., and Mo, L. Breaking the length barrier: Llm-enhanced ctr prediction in long textual user behaviors. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2311–2315, 2024.
+
+Golovneva, O., Wang, T., Weston, J., and Sukhbaatar, S. Contextual position encoding: Learning to count what's important. arXiv preprint arXiv:2405.18719, 2024.
+
+Guo, H., Tang, R., Ye, Y., Li, Z., and He, X. Deepfm: a factorization-machine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247, 2017.
+
+Han, R., Yin, B., Chen, S., Jiang, H., Jiang, F., Li, X., Ma, C., Huang, M., Li, X., Jing, C., et al. Mtgr: Industrial-scale generative recommendation framework in meituan. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management, pp. 5731–5738, 2025.
+
+He, X., Pan, J., Jin, O., Xu, T., Liu, B., Xu, T., Shi, Y., Atallah, A., Herbrich, R., Bowers, S., et al. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the eighth international workshop on data mining for online advertising, pp. 1–9, 2014.
+
+He, Z., Xie, Z., Jha, R., Steck, H., Liang, D., Feng, Y., Majumder, B. P., Kallus, N., and McAuley, J. Large language models as zero-shot conversational recommenders. In Proceedings of the 32nd ACM international conference on information and knowledge management, pp. 720–730, 2023.
+
+Hou, Y., Mu, S., Zhao, W. X., Li, Y., Ding, B., and Wen, J.-R. Towards universal sequence representation learning for recommender systems. In Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pp. 585–593, 2022.
+
+Hou, Y., He, Z., McAuley, J., and Zhao, W. X. Learning vector-quantized item representation for transferable sequential recommenders. In Proceedings of the ACM Web Conference 2023, pp. 1162–1171, 2023.
+
+Jia, J., Wang, Y., Li, Y., Chen, H., Bai, X., Liu, Z., Liang, J., Chen, Q., Li, H., Jiang, P., et al. Learn: Knowledge adaptation from large language model to recommendation for practical industrial application. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 11861–11869, 2025.
+Kang, W.-C. and McAuley, J. Self-attentive sequential recommendation. In 2018 IEEE international conference on data mining (ICDM), pp. 197–206. IEEE, 2018.
+
+Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
+
+Krell, M. M., Kosec, M., Perez, S. P., and Fitzgibbon, A. Efficient sequence packing without cross-contamination: Accelerating large language models without impacting performance. arXiv preprint arXiv:2107.02027, 2021.
+
+Li, L., Zhang, Y., Liu, D., and Chen, L. Large language models for generative recommendation: A survey and visionary discussions. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 10146–10159, 2024a.
+
+Li, R., Deng, W., Cheng, Y., Yuan, Z., Zhang, J., and Yuan, F. Exploring the upper limits of text-based collaborative filtering using large language models: Discoveries and insights. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management, pp. 1643–1653, 2025.
+
+Li, S., Guo, H., Tang, X., Tang, R., Hou, L., Li, R., and Zhang, R. Embedding compression in recommender systems: A survey. ACM Computing Surveys, 56(5):1–21, 2024b.
+
+Lin, J., Dai, X., Xi, Y., Liu, W., Chen, B., Zhang, H., Liu, Y., Wu, C., Li, X., Zhu, C., et al. How can recommender systems benefit from large language models: A survey. ACM Transactions on Information Systems, 43(2):1–47, 2025.
+
+Lin, Z., Ding, H., Hoang, N. T., Kveton, B., Deoras, A., and Wang, H. Pre-trained recommender systems: A causal debiasing perspective. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, pp. 424–433, 2024.
+
+Liu, J., Liu, C., Zhou, P., Lv, R., Zhou, K., and Zhang, Y. Is chatgpt a good recommender? a preliminary study. arXiv preprint arXiv:2304.10149, 2023.
+
+Luo, S., He, B., Zhao, H., Shao, W., Qi, Y., Huang, Y., Zhou, A., Yao, Y., Li, Z., Xiao, Y., et al. Recranker: Instruction tuning large language model as ranker for top-k recommendation. ACM Transactions on Information Systems, 43(5):1–31, 2025.
+
+Ma, J., Zhao, Z., Yi, X., Chen, J., Hong, L., and Chi, E. H. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th
+
+ACM SIGKDD international conference on knowledge discovery & data mining, pp. 1930–1939, 2018a.
+
+Ma, X., Zhao, L., Huang, G., Wang, Z., Hu, Z., Zhu, X., and Gai, K. Entire space multi-task model: An effective approach for estimating post-click conversion rate. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1137–1140, 2018b.
+
+Mudigere, D., Hao, Y., Huang, J., Jia, Z., Tulloch, A., Sridharan, S., Liu, X., Ozdal, M., Nie, J., Park, J., et al. Software-hardware co-design for fast and scalable training of deep learning recommendation models. In Proceedings of the 49th Annual International Symposium on Computer Architecture, pp. 993–1011, 2022.
+
+Naumov, M., Mudigere, D., Shi, H.-J. M., Huang, J., Sundaraman, N., Park, J., Wang, X., Gupta, U., Wu, C.-J., Azzolini, A. G., et al. Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091, 2019.
+
+Ning, L., Liu, L., Wu, J., Wu, N., Berlowitz, D., Prakash, S., Green, B., O'Banion, S., and Xie, J. User-llm: Efficient llm contextualization with user embeddings. In Companion Proceedings of the ACM on Web Conference 2025, pp. 1219–1223, 2025.
+
+Petrov, A. V. and Macdonald, C. Generative sequential recommendation with gptrec. arXiv preprint arXiv:2306.11114, 2023.
+
+Pi, Q., Bian, W., Zhou, G., Zhu, X., and Gai, K. Practice on long sequential user behavior modeling for click-through rate prediction. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 2671–2679, 2019.
+
+Pi, Q., Zhou, G., Zhang, Y., Wang, Z., Ren, L., Fan, Y., Zhu, X., and Gai, K. Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 2685–2692, 2020.
+
+Polyzotis, N., Zinkevich, M., Roy, S., Breck, E., and Whang, S. Data validation for machine learning. Proceedings of machine learning and systems, 1:334–347, 2019.
+
+Rajput, S., Mehta, N., Singh, A., Hulikal Keshavan, R., Vu, T., Heldt, L., Hong, L., Tay, Y., Tran, V., Samost, J., et al. Recommender systems with generative retrieval. Advances in Neural Information Processing Systems, 36: 10299–10315, 2023.
diff --git a/论文/HSTU.md b/论文/HSTU.md
index e4c634f..749eedb 100644
--- a/论文/HSTU.md
+++ b/论文/HSTU.md
@@ -12,7 +12,7 @@ Large-scale recommendation systems are characterized by their reliance on high c
Recommendation systems, quintessential in the realm of online content platforms and e-commerce, play a pivotal role
-
+
Figure 1. Total compute used to train deep learning models over the years. DLRM results are from (Mudigere et al., 2022); GRs are deployed models from this work. DLRMs/GRs are continuously trained in a streaming setting; we report compute used per year.
@@ -22,7 +22,8 @@ in personalizing billions of user experiences on a daily basis. State-of-the-art
Despite utilizing extensive human-engineered feature sets and training on vast amounts of data, most DLRMs in industry scale poorly with compute (Zhao et al., 2023). This limitation is noteworthy and remains unanswered.
-Inspired by the success achieved by Transformers in language and vision, we revisit fundamental design choices in modern recommendation systems. We observe that alternative formulations at billion-user scale need to overcome three challenges. First, features in recommendation systems lack explicit structures. While sequential formulations have been explored in small-scale settings (detailed discussionsin Appendix B), heterogeneous features, including high cardinality ids, cross features, counters, ratios, etc. play critical roles in industry-scale DLRMs (Mudigere et al., 2022). Second, recommendation systems use billion-scale vocabularies that change continuously. A billion-scale dynamic vocabulary, in contrast to 100K-scale static ones in language (Brown et al., 2020), creates training challenges and necessitates high inference cost given the need to consider tens of thousands of candidates in a target-aware fashion (Zhou et al., 2018; Wang et al., 2020). Finally, computational cost represents the main bottleneck in enabling large-scale sequential models. GPT-3 was trained on a total of 300B tokens over a period of 1-2 months with thousands of GPUs (Brown et al., 2020). This scale appears daunting, until we contrast it with the scale of user actions. The largest internet platforms serve billions of daily active users, who engage with billions of posts, images, and videos per day. User sequences could be of length up to $ 10^{5} $ (Chang et al., 2023). Consequently, recommendation systems need to handle a few orders of magnitude more tokens per day than what language models process over 1-2 months.
+Inspired by the success achieved by Transformers in language and vision, we revisit fundamental design choices in modern recommendation systems. We observe that alternative formulations at billion-user scale need to overcome three challenges. First, features in recommendation systems lack explicit structures. While sequential formulations have been explored in small-scale settings (detailed discussions
+in Appendix B), heterogeneous features, including high cardinality ids, cross features, counters, ratios, etc. play critical roles in industry-scale DLRMs (Mudigere et al., 2022). Second, recommendation systems use billion-scale vocabularies that change continuously. A billion-scale dynamic vocabulary, in contrast to 100K-scale static ones in language (Brown et al., 2020), creates training challenges and necessitates high inference cost given the need to consider tens of thousands of candidates in a target-aware fashion (Zhou et al., 2018; Wang et al., 2020). Finally, computational cost represents the main bottleneck in enabling large-scale sequential models. GPT-3 was trained on a total of 300B tokens over a period of 1-2 months with thousands of GPUs (Brown et al., 2020). This scale appears daunting, until we contrast it with the scale of user actions. The largest internet platforms serve billions of daily active users, who engage with billions of posts, images, and videos per day. User sequences could be of length up to $ 10^{5} $ (Chang et al., 2023). Consequently, recommendation systems need to handle a few orders of magnitude more tokens per day than what language models process over 1-2 months.
In this work, we treat user actions as a new modality in generative modeling. Our key insights are, a) core ranking and retrieval tasks in industrial-scale recommenders can be cast as generative modeling problems given an appropriate new feature space; b) this paradigm enables us to systematically leverage redundancies in features, training, and inference to improve efficiency. Due to our new formulation, we deployed models that are three orders of magnitude more computationally complex than prior state-of-the-art, while improving topline metrics by 12.4%, as shown in Figure 1.
@@ -42,7 +43,270 @@ Modern DLRM models are usually trained with a vast number of categorical ('spars
Categorical ('sparse') features. Examples of such features include items that user liked, creators in a category (e.g., Outdoors) that user is following, user languages, communities that user joined, cities from which requests were initiated, etc. We sequentialize these features as follows. We first select the longest time series, typically by merging the features that represent items user engaged with, as the main time series. The remaining features are generally time series that slowly change over time, such as demographics or followed creators. We compress these time series by keeping the earliest entry per consecutive segment and then merge the results into the main time series. Given these time series change very slowly, this approach does not significantly increase the overall sequence length.
-Numerical ('dense') features. Examples of such features include weighted and decayed counters, ratios, etc. For instance, one feature could represent user's past click through rate (CTR) on items matching a given topic. Compared to categorical features, these features change much more frequently, potentially with every single (user, item) interaction. It is therefore infeasible to fully sequentialize such features from computational and storage perspectives. However, an important observation is that the categorical features (e.g., item topics, locations) over which we perform these aggregations are already sequentialized and encoded in GRs. Hence, we can remove numerical features in GRs given a sufficiently expressive sequential transduction architecture coupled with a target-aware formulation (Zhou et al., 2018) can meaningfully capture numerical features as we increase
+Numerical ('dense') features. Examples of such features include weighted and decayed counters, ratios, etc. For instance, one feature could represent user's past click through rate (CTR) on items matching a given topic. Compared to categorical features, these features change much more frequently, potentially with every single (user, item) interaction. It is therefore infeasible to fully sequentialize such features from computational and storage perspectives. However, an important observation is that the categorical features (e.g., item topics, locations) over which we perform these aggregations are already sequentialized and encoded in GRs. Hence, we can remove numerical features in GRs given a sufficiently expressive sequential transduction architecture coupled with a target-aware formulation (Zhou et al., 2018) can meaningfully capture numerical features as we increase
+Cui, Z., Ma, J., Zhou, C., Zhou, J., and Yang, H. M6-rec: Generative pretrained language models are open-ended recommender systems, 2022.
+
+Dallmann, A., Zoller, D., and Hotho, A. A case study on sampling strategies for evaluating neural sequential item recommendation models. In Proceedings of the 15th ACM Conference on Recommender Systems, RecSys '21, pp. 505–514, 2021. ISBN 9781450384582.
+
+Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023.
+
+Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Ré, C. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
+
+Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, 2019. doi:10.18653/v1/n19-1423. URL https://doi.org/10.18653/v1/n19-1423.
+
+Eksombatchai, C., Jindal, P., Liu, J. Z., Liu, Y., Sharma, R., Sugnet, C., Ulrich, M., and Leskovec, J. Pixie: A system for recommending 3+ billion items to 200+ million users in real-time. In Proceedings of the 2018 World Wide Web Conference, WWW '18, pp. 1775–1784, 2018. ISBN 9781450356398.
+
+Elfwing, S., Uchibe, E., and Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. CoRR, abs/1702.03118, 2017. URL http://arxiv.org/abs/1702.03118.
+
+Gao, W., Fan, X., Wang, C., Sun, J., Jia, K., Xiao, W., Ding, R., Bin, X., Yang, H., and Liu, X. Learning an end-to-end structure for retrieval in large-scale recommendations. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management, CIKM '21, pp. 524–533, 2021. ISBN 9781450384469.
+
+Gillenwater, J., Kulesza, A., Fox, E., and Taskar, B. Expectation-maximization for learning determinantal point processes. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS'14, pp. 3149–3157, Cambridge, MA, USA, 2014. MIT Press.
+
+Gu, A., Goel, K., and Ré, C. Efficiently modeling long sequences with structured state spaces. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=uYLFOz1vlAC.
+
+Guo, H., Tang, R., Ye, Y., Li, Z., and He, X. Deepfm: A factorization-machine based neural network for ctr prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI'17, pp. 1725–1731, 2017. ISBN 9780999241103.
+
+Gupta, M. R., Bengio, S., and Weston, J. Training highly multiclass classifiers. J. Mach. Learn. Res., 15(1):1461–1492, Jan 2014. ISSN 1532-4435.
+
+He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
+
+He, X., Pan, J., Jin, O., Xu, T., Liu, B., Xu, T., Shi, Y., Atallah, A., Herbrich, R., Bowers, S., and Candela, J. Q. Practical lessons from predicting clicks on ads at facebook. In ADKDD'14: Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, New York, NY, USA, 2014. Association for Computing Machinery. ISBN 9781450329996.
+
+Hidasi, B., Karatzoglou, A., Baltrunas, L., and Tikk, D. Session-based recommendations with recurrent neural networks. In Bengio, Y. and LeCun, Y. (eds.), 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1511.06939.
+
+Hou, Y., Zhang, J., Lin, Z., Lu, H., Xie, R., McAuley, J., and Zhao, W. X. Large language models are zero-shot rankers for recommender systems. In Advances in Information Retrieval - 46th European Conference on IR Research, ECIR 2024, 2024.
+
+Hua, W., Dai, Z., Liu, H., and Le, Q. V. Transformer quality in linear time. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., and Sabato, S. (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 9099–9117. PMLR, 2022. URL https://proceedings.mlr.press/v162/hua22a.html.
+
+Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger, K. Deep networks with stochastic depth, 2016.
+
+Jegou, H., Douze, M., and Schmid, C. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal.
+Mach. Intell. 321. ISSN 0162-8828.
+
+doi ://doi.
+
+Kang, W.-C. and McAuley, J. Self-attentive sequential recommendation. In 2018 International Conference on Data Mining (ICDM), pp. 197–206, 2018.
+
+Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361.
+
+Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning, ICML'20. JMLR.org, 2020.
+
+Khudia, D., Huang, J., Basu, P., Deng, S., Liu, H., Park, J., and Smelyanskiy, M. Fbgemm: Enabling high-performance low-precision deep learning inference. arXiv preprint arXiv:2101.05615, 2021.
+
+Klenitskiy, A. and Vasilev, A. Turning dross into gold loss: is bert4rec really better than sasrec? In Proceedings of the 17th ACM Conference on Recommender Systems, RecSys '23, pp. 1120–1125, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400702419. doi: 10.1145/3604915.3610644. URL https://doi.org/10.1145/3604915.3610644.
+
+Korthikanti, V., Casper, J., Lym, S., McAfee, L., Andersch, M., Shoeybi, M., and Catanzaro, B. Reducing activation recomputation in large transformer models, 2022.
+
+Li, J., Wang, M., Li, J., Fu, J., Shen, X., Shang, J., and McAuley, J. Text is all you need: Learning language representations for sequential recommendation. In KDD, 2023.
+
+Li, C., Chang, E., Garcia-Molina, H., and Wiederhold, G. Clustering for approximate similarity search in high-dimensional spaces. IEEE Transactions on Knowledge and Data Engineering, 14(4):792–808, 2002.
+
+Liu, Z., Zou, L., Zou, X., Wang, C., Zhang, B., Tang, D., Zhu, B., Zhu, Y., Wu, P., Wang, K., and Cheng, Y. Monolith: Real time recommendation system with collisionless embedding table, 2022.
+
+Ma, J., Zhao, Z., Yi, X., Chen, J., Hong, L., and Chi, E. H. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. KDD '18, 2018.
+
+Mudigere, D., Hao, Y., Huang, J., Jia, Z., Tulloch, A., Sridharan, S., Liu, X., Ozdal, M., Nie, J., Park, J., Luo,
+
+L., Yang, J. A., Gao, L., Ivchenko, D., Basant, A., Hu, Y., Yang, J., Ardestani, E. K., Wang, X., Komuravelli, R., Chu, C.-H., Yilmaz, S., Li, H., Qian, J., Feng, Z., Ma, Y., Yang, J., Wen, E., Li, H., Yang, L., Sun, C., Zhao, W., Melts, D., Dhulipala, K., Kishore, K., Graf, T., Eisenman, A., Matam, K. K., Gangidi, A., Chen, G. J., Krishnan, M., Nayak, A., Nair, K., Muthiah, B., Khorashadi, M., Bhattacharya, P., Lapukhov, P., Naumov, M., Mathews, A., Qiao, L., Smelyanskiy, M., Jia, B., and Rao, V. Software-hardware co-design for fast and scalable training of deep learning recommendation models. In Proceedings of the 49th Annual International Symposium on Computer Architecture, ISCA '22, pp. 993–1011, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450386104. doi: 10.1145/3470496.3533727. URL https://doi.org/10.1145/3470496.3533727.
+
+Peng, B., Quesnelle, J., Fan, H., and Shippole, E. YaRN: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=wHBfxhZulu.
+
+Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Levskaya, A., Heek, J., Xiao, K., Agrawal, S., and Dean, J. Efficiently scaling transformer inference, 2022.
+
+Press, O., Smith, N. A., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=R8sQPpGCv0.
+
+Rabe, M. N. and Staats, C. Self-attention does not need $ o(n^{2}) $ memory, 2021.
+
+Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1), jan 2020. ISSN 1532-4435.
+
+Rendle, S. Factorization machines. In 2010 IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010. doi: 10.1109/ICDM.2010.127.
+
+Rendle, S., Krichene, W., Zhang, L., and Anderson, J. Neural collaborative filtering vs. matrix factorization revisited. In Fourteenth ACM Conference on Recommender Systems (RecSys'20), pp. 240–248, 2020. ISBN 9781450375832.
+
+Shazeer, N. Glu variants improve transformer, 2020.
+Shin, K., Kwak, H., Kim, S. Y., Ramström, M. N., Jeong, J., Ha, J.-W., and Kim, K.-M. Scaling law for recommendation models: towards general-purpose user representations. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, AAAI'23/IAAI'23/EAAI'23. AAAI Press, 2023. ISBN 978-1-57735-880-0. doi:10.1609/aaai.v37i4.25582. URL https://doi.org/10.1609/aaai.v37i4.25582.
+
+Shrivastava, A. and Li, P. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Advances in Neural Information Processing Systems, volume 27, 2014.
+
+Sileo, D., Vossen, W., and Raymaekers, R. Zero-shot recommendation as language modeling. In Hagen, M., Verberne, S., Macdonald, C., Seifert, C., Balog, K., Nørvåg, K., and Setty, V. (eds.), Advances in Information Retrieval - 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10-14, 2022, Proceedings, Part II, volume 13186 of Lecture Notes in Computer Science, pp. 223–230. Springer, 2022. doi: 10.1007/978-3-030-99739-7\_26. URL https://doi.org/10.1007/978-3-030-99739-7\_26.
+
+Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., and Liu, Y. Roformer: Enhanced transformer with rotary position embedding, 2023.
+
+Sun, F., Liu, J., Wu, J., Pei, C., Lin, X., Ou, W., and Jiang, P. Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM '19, pp. 1441–1450, 2019. ISBN 9781450369763.
+
+Tang, H., Liu, J., Zhao, M., and Gong, X. Progressive layered extraction (ple): A novel multi-task learning (mtl) model for personalized recommendations. In Proceedings of the 14th ACM Conference on Recommender Systems, RecSys '20, pp. 269–278, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450375832. doi: 10.1145/3383313.3412236. URL https://doi.org/10.1145/3383313.3412236.
+
+Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. Llama: Open and efficient foundation language models, 2023a.
+
+Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
+
+Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. Llama 2: Open foundation and fine-tuned chat models, 2023b.
+
+Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS'17, pp. 6000–6010, 2017. ISBN 9781510860964.
+
+Wang, R., Shivanna, R., Cheng, D., Jain, S., Lin, D., Hong, L., and Chi, E. Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. In Proceedings of the Web Conference 2021, WWW '21, pp. 1785–1797, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383127. doi: 10.1145/3442381.3450078. URL https://doi.org/10.1145/3442381.3450078.
+
+Wang, Z., Zhao, L., Jiang, B., Zhou, G., Zhu, X., and Gai, K. Cold: Towards the next generation of pre-ranking system, 2020.
+
+Xia, X., Eksombatchai, P., Pancha, N., Badani, D. D., Wang, P.-W., Gu, N., Joshi, S. V., Farahpour, N., Zhang, Z., and Zhai, A. Transact: Transformer-based real-time user action model for recommendation at pinterest. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD '23, pp. 5249–5259, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701030. doi: 10.1145/3580305.3599918. URL https://doi.org/10.1145/3580305.3599918.
+
+Xiao, J., Ye, H., He, X., Zhang, H., Wu, F., and Chua, T.-S. Attentional factorization machines: Learning the weight of feature interactions via attention networks. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI'17, pp. 3119–3125. AAAI Press, 2017. ISBN 9780999241103.
+
+Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T.-Y. On layer normalization in the transformer architecture. In Proceedings of the 37th International Conference on Machine Learning, ICML'20. JMLR.org, 2020.
+Yang, J., Yi, X., Zhiyuan Cheng, D., Hong, L., Li, Y., Xiaoming Wang, S., Xu, T., and Chi, E. H. Mixed negative sampling for learning two-tower neural networks in recommendations. In Companion Proceedings of the Web Conference 2020, WWW '20, pp. 441–447, 2020. ISBN 9781450370240.
+
+Zhai, J., Lou, Y., and Gehrke, J. Atlas: A probabilistic algorithm for high dimensional similarity search. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, pp. 997–1008, 2011. ISBN 9781450306614.
+
+Zhai, J., Gong, Z., Wang, Y., Sun, X., Yan, Z., Li, F., and Liu, X. Revisiting neural retrieval on accelerators. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD '23, pp. 5520–5531, New York, NY, USA, 2023a. Association for Computing Machinery. ISBN 9798400701030. doi: 10.1145/3580305.3599897. URL https://doi.org/10.1145/3580305.3599897.
+
+Zhai, Y., Jiang, C., Wang, L., Jia, X., Zhang, S., Chen, Z., Liu, X., and Zhu, Y. Bytetransformer: A high-performance transformer boosted for variable-length inputs. In 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 344–355, Los Alamitos, CA, USA, May 2023b. IEEE Computer Society. doi: 10.1109/IPDPS54959.2023.00042. URL https://doi.ieeecomputersociety.org/10.1109/IPDPS54959.2023.00042.
+
+Zhang, B., Luo, L., Liu, X., Li, J., Chen, Z., Zhang, W., Wei, X., Hao, Y., Tsang, M., Wang, W., Liu, Y., Li, H., Badr, Y., Park, J., Yang, J., Mudigere, D., and Wen, E. Dhen: A deep and hierarchical ensemble network for large-scale click-through rate prediction, 2022.
+
+Zhao, X., Xia, L., Zhang, L., Ding, Z., Yin, D., and Tang, J. Deep reinforcement learning for page-wise recommendations. In Proceedings of the 12th ACM Conference on Recommender Systems, RecSys '18, pp. 95–103, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450359016. doi: 10.1145/3240323.3240374. URL https://doi.org/10.1145/3240323.3240374.
+
+Zhao, Z., Yang, Y., Wang, W., Liu, C., Shi, Y., Hu, W., Zhang, H., and Yang, S. Breaking the curse of quality saturation with user-centric ranking, 2023.
+
+Zhou, G., Zhu, X., Song, C., Fan, Y., Zhu, H., Ma, X., Yan, Y., Jin, J., Li, H., and Gai, K. Deep interest network for click-through rate prediction. KDD '18, 2018.
+
+Zhou, K., Wang, H., Zhao, W. X., Zhu, Y., Wang, S., Zhang, F., Wang, Z., and Wen, J.-R. S3-rec: Self-supervised learning for sequential recommendation with
+
+mutual information maximization. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, CIKM '20, pp. 1893–1902, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450368599. doi: 10.1145/3340531.3411954. URL https://doi.org/10.1145/3340531.3411954.
+
+Zhuo, J., Xu, Z., Dai, W., Zhu, H., Li, H., Xu, J., and Gai, K. Learning optimal tree models under beam search. In Proceedings of the 37th International Conference on Machine Learning, ICML'20. JMLR.org, 2020.
+### A. Notations
+
+We summarize key notations used in this paper in Table 8 and Table 9.
+
+
+| Symbol | Description |
| $ \Psi_{k}(t_{j}) $ | The k-th training example (k is ordered globally) emitted by the feature logging system at time $ t_{j} $. In a typical DLRM recommendation system, after the user consumes some content $ \Phi_{i} $ (by responding with an action $ a_{i} $ such as skip, video completion and share), the feature logging system joins the tuple $ (\Phi_{i}, a_{i}) $ with the features used to rank $ \Phi_{i} $, and emits $ (\Phi_{i}, a_{i}) $ features for $ \Phi_{i} $ as a training example $ \Psi_{k}(t_{j}) $. As discussed in Section 2.3, DLRMs and GRs deal with different numbers of training examples, with the number of examples in GRs typically being 1-2 orders of magnitude smaller. |
| $ n_{c}(n_{c,i}) $ | Number of contents that user has interacted with (of user/sample i). |
| $ \Phi_{0}, \dots, \Phi_{n_{c}-1} $ | List of contents that a user has interacted with, in the context of a recommendation system. List of user actions corresponding to $ \Phi_{i} $s. When all predicted events are binary, each action can be considered a multi-hot vector over (atomic) events such as like, share, comment, image view, video initialization, video completion, hide, etc. |
| $ a_{0}, \dots, a_{n_{c}-1} $ | List of user actions corresponding to the value of $ a_{0} $, the value of $ a_{1} $, the value of $ a_{2} $, the value of $ a_{3} $, the value of $ a_{4} $, the value of $ a_{5} $, the value of $ a_{6} $, the value of $ a_{7} $, the value of $ a_{8} $, the value of $ a_{9} $, the value of $ a_{10} $, the value of $ a_{11} $, the value of $ a_{12} $, the value of $ a_{13} $, the value of $ a_{14} $, the value of $ a_{15} $, the value of $ a_{16} $, the value of $ a_{17} $, the value of $ a_{18} $, the value of $ a_{19} $, the value of $ a_{20} $, the value of $ a_{21} $, the value of $ a_{22} $, the value of $ a_{23} $, the value of $ a_{24} $, the value of $ a_{25} $, the value of $ a_{26} $, the value of $ a_{27} $, the value of $ a_{28} $, the value of $ a_{29} $, the value of $ a_{30} $, the value of $ a_{31} $, the value of $ a_{32} $, the value of $ a_{33} $, the value of $ a_{34} $, the value of $ a_{35} $, the value of $ a_{36} $, the value of $ a_{37} $, the value of $ a_{38} $, the value of $ a_{39} $, the value of $ a_{40} $, the value of $ a_{41} $, the value of $ a_{42} $, the value of $ a_{43} $, the value of $ a_{44} $, the value of $ a_{45} $, the value of $ a_{46} $, the value of $ a_{47} $, the value of $ a_{48} $, the value of $ a_{49} $, the value of $ a_{50} $, the value of $ a_{51} $, the value of $ a_{52} $, the value of $ a_{53} $, the value of $ a_{54} $, the value of $ a_{55} $, the value of $ a_{56} $, the value of $ a_{57} $, the value of $ a_{58} $, the value of $ a_{59} $, the value of $ a_{60} $, the value of $ a_{61} $, the value of $ a_{62} $, the value of $ a_{63} $, the value of $ a_{64} $, the value of $ a_{65} $, the value of $ a_{66} $, the value of $ a_{67} $, the value of $ a_{68} $, the value of $ a_{69} $, the value of $ a_{70} $, the value of $ a_{71} $, the value of $ a_{72} $, the value of $ a_{73} $, the value of $ a_{74} $, the value of $ a_{75} $, the value of $ a_{76} $, the value of $ a_{77} $, the value of $ a_{78} $, the value of $ a_{79} $, the value of $ a_{80} $, the value of $ a_{81} $, the value of $ a_{82} $, the value of $ a_{83} $, the value of $ a_{84} $, the value of $ a_{85} $, the value of $ a_{86} $, the value of $ a_{87} $, the value of $ a_{88} $, the value of $ a_{89} $, the value of $ a_{90} $, the value of $ a_{91} $, the value of $ a_{92} $, the value of $ a_{93} $, the value of $ a_{94} $, the value of $ a_{95} $, the value of $ a_{96} $, the value of $ a_{97} $, the value of $ a_{98} $, the value of $ a_{99} $, the value of $ a_{10} $, the value of $ a_{11} $, the value of $ a_{12} $, the value of $ a_{13} $, the value of $ a_{14} $, the value of $ a_{15} $, the value of $ a_{16} $, the value of $ a_{17} $, the value of $ a_{18} $, the value of $ a_{19} $, the value of $ a_{20} $, the value of $ a_{21} $, the value of $ a_{22} $, the value of $ a_{23} $, the value of $ a_{24} $, the value of $ a_{25} $, the value of $ a_{26} $, the value of $ a_{27} $, the value of $ a_{28} $, the value of $ a_{29} $, the value of $ a_{30} $, the value of $ a_{31} $, the value of $ a_{32} $, the value of $ a_{33} $, the value of $ a_{34} $, the value of $ a_{35} $, the value of $ a_{36} $, the value of $ a_{37} $, the value of $ a_{38} $, the value of $ a_{39} $, the value of $ a_{40} $, the value of $ a_{41} $, the value of $ a_{42} $, the value of $ a_{43} $, the value of $ a_{44} $, the value of $ a_{45} $, the value of $ a_{46} $, the value of $ a_{47} $, the value of $ a_{48} $, the value of $ a_{49} $, the value of $ a_{50} $, the value of $ a_{51} $, the value of $ a_{52} $, the value of $ a_{53} $, the value of $ a_{54} $, the value of $ a_{55} $, the value of $ a_{56} $, the value of $ a_{57} $, the value of $ a_{58} $, the value of $ a_{59} $, the value of $ a_{60} $, the value of $ a_{61} $, the value of $ a_{62} $, the value of $ a_{63} $, the value of $ a_{64} $, the value of $ a_{65} $, the value of $ a_{66} $, the value of $ a_{67} $, the value of $ a_{68} $, the value of $ a_{69} $, the value of $ a_{70} $, the value of $ a_{71} $, the value of $ a_{72} $, the value of $ a_{73} $, the value of $ a_{74} $, the value of $ a_{75} $, the value of $ a_{76} $, the value of $ a_{77} $, the value of $ a_{78} $, the value of $ a_{79} $, the value of $ a_{80} $, the value of $ a_{81} $, the value of $ a_{82} $, the value of $ a_{83} $, the value of $ a_{84} $, the value of $ a_{85} $, the value of $ a_{86} $, the value of $ a_{87} $, the value of $ a_{88} $, the value of $ a_{89} $, the value of $ a_{90} $, the value of $ a_{91} $, the value of $ a_{92} $, the value of $ a_{93} $, the value of $ a_{94} $, the value of $ a_{95} $, the value of $ a_{96} $, the value of $ a_{97} $, the value of $ a_{98} $, the value of $ a_{99} $, the value of $ a_{10} $, the value of $ a_{11} $, the value of $ a_{12} $, the value of $ a_{13} $, the value of $ a_{14} $, the value of $ a_{15} $, the value of $ a_{16} $, the value of $ a_{17} $, the value of $ a_{18} $, the value of $ a_{19} $, the value of $ a_{20} $, the value of $ a_{21} $, the value of $ a_{22} $, the value of $ a_{23} $, the value of $ a_{24} $, the value of $ a_{25} $, the value of $ a_{26} $, the value of $ a_{27} $, the value of $ a_{28} $, the value of $ a_{29} $, the value of $ a_{30} $, the value of $ a_{31} $, the value of $ a_{32} $, the value of $ a_{33} $, the value of $ a_{34} $, the value of $ a_{35} $, the value of $ a_{36} $, the value of $ a_{37} $, the value of $ a_{38} $, the value of $ a_{39} $, the value of $ a_{40} $, the value of $ a_{41} $, the value of $ a_{42} $, the value of $ a_{43} $, the value of $ a_{44} $, the value of $ a_{45} $, the value of $ a_{46} $, the value of $ a_{47} $, the value of $ a_{48} $, the value of $ a_{49} $, the value of $ a_{50} $, the value of $ a_{51} $, the value of $ a_{52} $, the value of $ a_{53} $, the value of $ a_{54} $, the value of $ a_{55} $, the value of $ a_{56} $, the value of $ a_{57} $, the value of $ a_{58} $, the value of $ a_{59} $, the value of $ a_{60} $, the value of $ a_{61} $, the value of $ a_{62} $, the value of $ a_{63} $, the value of $ a_{64} $, the value of $ a_{65} $, the value of $ a_{66} $, the value of $ a_{67} $, the value of $ a_{68} $, the value of $ a_{69} $, the value of $ a_{70} $, the value of $ a_{71} $, the value of $ a_{72} $, the value of $ a_{73} $, the value of $ a_{74} $, the value of $ a_{75} $, the value of $ a_{76} $, the value of $ a_{77} $, the value of $ a_{78} $, the value of $ a_{79} $, the value of $ a_{80} $, the value of $ a_{81} $, the value of $ a_{82} $, the value of $ a_{83} $, the value of $ a_{84} $, the value of $ a_{85} $, the value of $ a_{86} $, the value of $ a_{87} $, the value of $ a_{88} $, the value of $ a_{89} $, the value of $ a_{90} $, the value of $ a_{91} $, the value of $ a_{92} $, the value of $ a_{93} $, the value of $ a_{94} $, the value of $ a_{95} $, the value of $ a_{96} $, the value of $ a_{97} $, the value of $ a_{98} $, the value of $ a_{99} $, the value of $ a_{10} $, the value of $ a_{11} $, the value of $ a_{12} $, the value of $ a_{13} $, the value of $ a_{14} $, the value of $ a_{15} $, the value of $ a_{16} $, the value of $ a_{17} $, the value of $ a_{18} $, the value of $ a_{19} $, the value of $ a_{20} $, the value of $ a_{21} $, the value of $ a_{22} $, the value of $ a_{23} $, the value of $ a_{24} $, the value of $ a_{25} $, the value of $ a_{26} $, the value of $ a_{27} $, the value of $ a_{28} $, the value of $ a_{29} $, the value of $ a_{30} $, the value of $ a_{31} $, the value of $ a_{32} $, the value of $ a_{33} $, the value of $ a_{34} $, the value of $ a_{35} $, the value of $ a_{36} $, the value of $ a_{37} $, the value of $ a_{38} $, the value of $ a_{39} $, the value of $ a_{40} $, the value of $ a_{41} $, the value of $ a_{42} $, the value of $ a_{43} $, the value of $ a_{44} $, the value of $ a_{45} $, the value of $ a_{46} $, the value of $ a_{47} $, the value of $ a_{48} $, the value of $ a_{49} $, the value of $ a_{50} $, the value of $ a_{51} $, the value of $ a_{52} $, the value of $ a_{53} $, the value of $ a_{54} $, the value of $ a_{55} $, the value of $ a_{56} $, the value of $ a_{57} $, the value of $ a_{58} $, the value of $ a_{59} $, the value of $ a_{60} $, the value of $ a_{61} $, the value of $ a_{62} $, the value of $ a_{63} $, the value of $ a_{64} $, the value of $ a_{65} $, the value of $ a_{66} $, the value of $ a_{67} $, the value of $ a_{68} $, the value of $ a_{69} $, the value of $ a_{70} $, the value of $ a_{71} $, the value of $ a_{72} $, the value of $ a_{73} $, the value of $ a_{74} $, the value of $ a_{75} $, the value of $ a_{76} $, the value of $ a_{77} $, the value of $ a_{78} $, the value of $ a_{79} $, the value of $ a_{80} $, the value of $ a_{81} $, the value of $ a_{82} $, the value of $ a_{83} $, the value of $ a_{84} $, the value of $ a_{85} $, the value of $ a_{86} $, the value of $ a_{87} $, the value of $ a_{88} $, the value of $ a_{89} $, the value of $ a_{90} $, the value of $ a_{91} $, the value of $ a_{92} $, the value of $ a_{93} $, the value of $ a_{94} $, the value of $ a_{95} $, the value of $ a_{96} $, the value of $ a_{97} $, the value of $ a_{98} $, the value of $ a_{99} $, the value of $ a_{10} $, the value of $ a_{11} $, the value of $ a_{12} $, the value of $ a_{13} $, the value of $ a_{14} $, the value |
+
+Table 8. Table of Notations (continued on the next page).
+
+
+### B. Generative Recommenders: Background and Formulations
+
+Many readers are likely more familiar with classical Deep Learning Recommendation Models (DLRMs) (Mudigere et al., 2022) given its popularity from YouTube DNN days (Covington et al., 2016) and its widespread usage in every single large online content and e-commerce platform (Cheng et al., 2016; Zhou et al., 2018; Wang et al., 2021; Chang et al., 2023; Xia et al., 2023; Zhai et al., 2023a). DLRMs operate on top of heterogeneous feature spaces using various neural
+Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
+
+
+
+| Symbol | Description |
| X | Input to an HSTU layer. In standard terminology (before batching), $ X \in \mathbb{R}^{N \times d} $ assuming we have a input sequence containing N tokens. |
| $ Q(X) $, $ K(X) $, $ V(X) $ | Query, key, value in HSTU obtained for a given input X based on Equation (1). The definition is similar to Q, K, and V in standard Transformers. $ Q(X) $, $ K(X) \in \mathbb{R}^{h \times N \times d_{qk}} $, and $ V(X) \in \mathbb{R}^{h \times N \times d_v} $. |
| $ U(X) $ | HSTU uses $ U(X) $ to “gate” attention-pooled values ( $ V(X) $) in Equation (3), which together with $ f_2(\cdot) $, enables HSTU to avoid feedforward layers altogether. $ U(X) \in \mathbb{R}^{h \times N \times d_v} $. |
| $ A(X) $ | Attention tensor obtained for input X. $ A(X) \in \mathbb{R}^{h \times N \times N} $. |
| $ Y(X) $ | Output of a HSTU layer obtained for the input X. $ Y(X) \in \mathbb{R}^{d} $. |
| Split( $ \cdot $) | The operation that splits a tensor into chunks. $ \phi_1(f_1(X)) \in \mathbb{R}^{N \times (2hd_{qk} + 2hd_v)} $ in Equation (1); we obtain $ U(X) $, $ V(X) $ (both of shape $ h \times N \times d_v $), $ Q(X) $, $ K(X) $ (both of shape $ h \times N \times d_{qk} $) by splitting the larger tensor (and permitting dimensions) with $ U(X) $, $ V(X) $, $ Q(X) $, $ K(X) = \text{Split}(\phi_1(f_1(X))) $. |
| $ \text{rab}^{p,t} $ | relative attention bias that incorporates both positional (Raffel et al., 2020) and temporal information (based on the time when the tokens are observed, $ t_0, \ldots, t_{n-1} $; one possible implementation is to apply some bucketization function to $ (t_j - t_i) $ for $ (i, j) $). In practice, we share $ \text{rab}^{p,t} $ across different attention heads within a layer, hence $ \text{rab}^{p,t} \in \mathbb{R}^{1 \times N \times N} $. |
| $ \alpha $ | Parameter controlling sparsity in the Stochastic Length algorithm used in HSTU (Section 3.2). |
| $ R $ | Register size on GPUs, in the context of the HSTU algorithm discussed in Section 3.2. |
| m | Number of candidates considered in a recommendation system's ranking stage. |
| $ b_m $ | Microbatch size, in the M-FALCON algorithm discussed in Section 3.4. |
+
+Table 9. Table of Notations (continued)
+
+
+networks including feature interaction modules (Guo et al., 2017; Xiao et al., 2017; Wang et al., 2021), sequential pooling or target-aware pairwise attention modules (Hidasi et al., 2016; Zhou et al., 2018; Chang et al., 2023) and advanced multi-expert multi-task modules (Ma et al., 2018; Tang et al., 2020). We hence provided an overview of Generative Recommenders (GRs) by contrasting them with classical DLRMs explicitly in Section 2 and Section 3. In this section, we give the readers an alternative perspective starting from the classical sequential recommender literature.
+
+#### B.1. Background: Sequential Recommendations in Academia and Industry
+
+##### B.1.1. ACADEMIC RESEARCH (TRADITIONAL SEQUENTIAL RECOMMENDER SETTINGS)
+
+Recurrent neural networks (RNNs) were first applied to recommendation scenarios in GRU4Rec (Hidasi et al., 2016). Hidasi et al. (2016) considered Gated Recurrent Units (GRUs) and applied them over two datasets, RecSys Challenge 2015 $ ^{2} $ and VIDEO (a proprietary dataset). In both cases, only positive events (clicked e-commerce items or videos where users spent at least a certain amount of time watching) were kept as part of the input sequence. We further observe that in a classical industrial-scale two-stage recommendation system setup consisting of retrieval and ranking stages (Covington et al., 2016), the task that Hidasi et al. (2016) solved primarily maps to the retrieval task.
+
+Transformers, sequential transduction architectures, and their variants. Advances in sequential transduction architectures in later years, in particular Transformers (Vaswani et al., 2017), have motivated similar advancements in recommendation systems. SASRec (Kang & McAuley, 2018) first applied Transformers in an autoregressive setting. They considered the presence of a review or rating as positive feedback, thereby converting classical datasets like Amazon Reviews $ ^3 $ and MovieLens $ ^4 $ to sequences of positive items, similar to GRU4Rec. A binary cross entropy loss was employed, where positive target is defined as the next “positive” item (recall this is in essence just presence of a review or rating), and negative target is randomly sampled from the item corpus $ \mathbb{X} = \mathbb{X}_c $.
+Most subsequent research were built upon similar settings as GRU4Rec (Hidasi et al., 2016) and SASRec (Kang & McAuley, 2018) discussed above, such as BERT4Rec (Sun et al., 2019) applying bidirectional encoder setting from BERT (Devlin et al., 2019). S3Rec (Zhou et al., 2020) introducing an explicit pre-training stage, and so on.
+
+##### B.1.2. INDUSTRIAL APPLICATIONS AS PART OF DEEP LEARNING RECOMMENDATION MODELS (DLRMS)
+
+Sequential approaches, including sequential encoders and pairwise attention modules, have been widely applied in industrial settings due to their ability to enhance user representations as part of DLRMs. DLRMs commonly use relatively small sequence lengths, such as 20 in BST (Chen et al., 2019), 1,000 in DIN (Zhou et al., 2018), and 100 in TransAct (Xia et al., 2023). We observe that these are 1-3 orders of magnitude smaller compared with 8,192 in this work (Section 4.3).
+
+Despite using short sequence lengths, most DLRMs can successfully capture long-term user preferences. This can be attributed to two key aspects. First, precomputed user profiles/embeddings (Xia et al., 2023) or external vector stores (Chang et al., 2023) are commonly used in modern DLRMs, both of which effectively extend lookback windows. Second, a significant number of contextual-, user-, and item-side features were generally employed (Zhou et al., 2018; Chen et al., 2019; Chang et al., 2023; Xia et al., 2023) and various heterogeneous networks, such as FMs (Xiao et al., 2017; Guo et al., 2017), DCNs (Wang et al., 2021), MoEs, etc. are used to transform representations and combine outputs.
+
+In contrast to sequential settings discussed in Appendix B.1.1, all major industrial work defines loss over (user/request, candidate item) pairs. In the ranking setting, a multi-task binary cross-entropy loss is commonly used. In the retrieval setting, two tower setting (Covington et al., 2016) remains the dominant approach. Recent work has investigated representing the next item to recommend as a probability distribution over a sequence of (sub-)tokens, such as OTM (Zhuo et al., 2020), and DR (Gao et al., 2021) (note that in other recent work, the same setting is sometimes denoted as “generative retrieval”). They commonly utilize beam search to decode the item from sub-tokens. Advanced learned similarity functions, such as mixture-of-logits (Zhai et al., 2023a), have also been proposed and deployed as an alternative to two-tower setting and beam search given proliferation of modern accelerators such as GPUs, custom ASICs, and TPUs.
+
+From a problem formulation perspective, we consider all work discussed above part of DLRMs (Mudigere et al., 2022) given the model architectures, features used, and losses used differ significantly from academic sequential recommender research discussed in Appendix B.1.1. It's also worth remarking that there have been no successful applications of fully sequential ranking settings in industry, especially not at billion daily active users (DAU) scale, prior to this work.
+
+#### B.2. Formulations: Ranking and Retrieval as Sequential Transduction Tasks in Generative Recommenders (GRs)
+
+We next discuss three limitations in the traditional sequential recommender settings and DLRM settings, and how Generative Recommenders (GRs) address them from a problem formulation perspective.
+
+Ignorance of features other than user-interacted items. Past sequential formulations only consider contents (items) users explicitly interacted with (Hidasi et al., 2016; Kang & McAuley, 2018; Sun et al., 2019; Zhou et al., 2020), while industry-scale recommendation systems prior to GRs are trained over a vast number of features to enhance the representation of users and contents (Covington et al., 2016; Cheng et al., 2016; Zhou et al., 2018; Chen et al., 2019; Chang et al., 2023; Xia et al., 2023; Zhai et al., 2023a). GR addresses this limitation by a) compressing other categorical features and merging them with the main time series, and b) capturing numerical features through cross-attention interaction utilizing a target-aware formulation as discussed in Section 2.1 and Figure 2. We validate this by showing that the traditional “interaction-only” formulation that ignores such features degrades model quality significantly; experiment results can be found in the rows labeled “GR (interactions only)” in Table 7 and Table 6, where we show utilizing only interaction history led to a 1.3% decrease in hit rate@100 for retrieval and a 2.6% NE decrease in ranking (recall a 0.1% change in NE is significant, as discussed in Sections 4.1.2 and 4.3.1).
+
+User representations are computed in a target-independent setting. A second issue is most traditional sequential recommenders, including GRU4Rec (Hidasi et al., 2016), SASRec (Kang & McAuley, 2018), BERT4Rec (Sun et al., 2019), S3Rec (Zhou et al., 2020), etc. are formulated in a target-independent fashion where for a target item $ \Phi_i, \Phi_0, \Phi_1, \ldots, \Phi_{i-1} $ are used as encoder input to compute user representations, which is then used to provide predictions. In contrast, most major DLRM approaches used in industrial settings formulated the sequential modules used in a target-aware fashion, with the ability to incorporate “target” (ranking candidate) information into the user representations. These include DIN (Zhou et al., 2018) (Alibaba), BST (Chen et al., 2019) (Alibaba), TWIN (Chang et al., 2023) (Kwai), and TransAct (Xia et al., 2023) (Pinterest).
+Generative Recommenders (GRs) combines the best of both worlds by interleaving the content and action sequences (Section 2.2) to enable applying target-aware attention in causal, autoregressive settings. We categorize and contrast prior work and this work in Table 10 $ ^{5} $.
+
+
+
+ | Input for target item $ i $ | Expected output for target item $ i $ | Architecture | Training Procedure |
| GRs | $ \Phi_0, a_0, \Phi_1, a_1, ..., \Phi_i $ | $ a_i $ (target-aware) | Self-attention (HSTU) | Causal autoregressive (streaming/single-pass) |
| GRU4Rec\nSASRec | $ \Phi_0, \Phi_1, ..., \Phi_{i-1} $ | $ \Phi_i $ | RNNs (GRUs)\nSelf-attention (Transformers) | Causal autoregressive (multi-pass) |
| BERT4Rec\nS3Rec | $ \Phi_0, \Phi_1, ..., \Phi_{i-1} $\n(at inference time) | $ \Phi_i $ | Self-attention (Transformers) | Sequential multi-pass $ ^6 $ |
| DIN\nBST\nTWIN\nTransAct | $ \Phi_0, \Phi_1, ..., \Phi_i $\n $ (\Phi_0, a_0), ..., (\Phi_{i-1}, a_{i-1}), \Phi_i $ | $ a_i $ (target aware, implicitly as part of DLRMs) | Pairwise attention\nSelf-attention (Transformers)\nTwo-stage pairwise attention\nSelf-attention (Transformers) | Pointwise (generally streaming/single pass) |
+
+Table 10. Comparison of prior work on sequential recommenders and GRs, in the ranking setting, with DLRMs included for completeness.
+
+Discriminative formulations restrict applicability of prior sequential recommender work to pointwise settings. Finally, traditional sequential recommenders are discriminative by design. Existing sequential recommender literature, including seminal work such as GRU4Rec and SASRec, model $ p(\Phi_i|\Phi_0, a_0, \ldots, \Phi_{i-1}, a_{i-1}) $, or the conditional distribution of the next item to recommend given users' current states. On the other hand, we observe that there are two probabilistic processes in standard recommendation systems, namely the process of the recommendation system suggesting a content $ \Phi_i $ (e.g., some photo or video) to the user, and the process of the user reacting to the suggested content $ \Phi_i $ via some action $ a_i $ (which can be a combination of like, video completion, skip, etc.).
+
+A generative approach needs to model the joint distribution over the sequence of suggested contents and user actions, or $p(\Phi_{0}, a_{0}, \Phi_{1}, a_{1}, \ldots, \Phi_{n_{c}-1}, a_{n_{c}-1})$, as discussed in Section 2.2. Our proposal of Generative Recommenders enables modeling of such distributions, as shown in Table 11 (Figure 8). Note that the next action token $(a_{i})$ prediction task is exactly the GR ranking setting discussed in Table 1, whereas the next content $(\Phi_{i})$ prediction task is similar to the retrieval setting adapted to the interleaved setting, with the target changed in order to learn the input data distribution.
+
+
+| Task | Specification (Inputs / Outputs / Length) |
| $ x_{i}s $ | $ \Phi_{0}, a_{0}, \Phi_{1}, a_{1}, ..., \Phi_{n_{c}-2}, a_{n_{c}-2}, \Phi_{n_{c}-1}, a_{n_{c}-1} $ |
| $ y_{i}s $ | $ a_{0}, \varnothing, a_{1}, \varnothing, ..., a_{n_{c}-2}, \varnothing, a_{n_{c}-1}, \varnothing $ |
| $ n $ | $ 2n_{c} $ |
| $ x_{i}s $ | $ \Phi_{0}, a_{0}, \Phi_{1}, a_{1}, ..., \Phi_{n_{c}-2}, a_{n_{c}-2}, \Phi_{n_{c}-1}, a_{n_{c}-1} $ |
| $ y_{i}s $ | $ \varnothing, \Phi_{1}, \varnothing, \Phi_{2}, ..., \varnothing, \Phi_{n_{c}-1}, \varnothing, \varnothing $ |
| $ n $ | $ 2n_{c} $ |
+
+Table 11. Generative modeling over $ p(\Phi_0, a_0, \ldots, \Phi_{n_c-1}, a_{n_c-1}) $. An illustration is provided in Figure 8.
+
+
+Importantly, this formulation not only enables proper modeling of data distribution but further enables sampling sequences of items to recommend to the user directly via e.g., beam search. We hypothesize that this will lead to a superior approach compared with traditional listwise settings (e.g., DPP (Gillenwater et al., 2014) and RL (Zhao et al., 2018)), and we leave the full formulation and evaluation of such systems (briefly discussed in Section 6) as a future work.
+
+### C. Evaluation: Synthetic Data
+
+As previously discussed in Section 3.1, standard softmax attention, due to its normalization factor, makes it challenging to capture intensity of user preferences which is important for user representation learning. This aspect is important in recommendation scenarios as the system may need to predict the intensity of engagements (e.g., number of future positive responses).
+
+
+
+Figure 8. Comparison of traditional sequential recommenders (left) and Generative Recommenders (right). We illustrate sequential recommenders in causal autoregressive settings and GRs without contextual features to facilitate comparison. On the left hand side, the action types $ a_{i} $s are either ignored or combined with item information $ \Phi_{i} $s using MLPs, before going into self-attention blocks.
+
+
+actions on a particular topic) in addition to the relative ordering of items.
+
+To understand this behavior, we construct synthetic data following a Dirichlet Process that generates streaming data over a dynamic set of vocabulary. Dirichlet Process captures the behavior that ‘rich gets richer’ in user engagement histories. We set up the synthetic experiment as follows:
+
+• We randomly assign each one of 20,000 item ids to exactly one of 100 categories.
+
+• We generate 1,000,000 records of length 128 each, with the first 90% being used for training and the final 10% used for testing. To simulate the streaming training setting, we make the initial 40% of item ids available initially and the rest available progressively at equal intervals; i.e., at record 500,000, the maximum id that can be sampled is $ (40\% + 60\% \times 0.5) \times 20,000 = 14,000 $.
+
+• We randomly select up to 5 categories out of 100 for each record and randomly sample a prior $ H_{c} $ over these 5 categories. We sequentially sample category for each position following a Dirichlet process over possible categories as follows:
+
+- for n > 1:
+
+ $ ^{*} $ with probability $ \alpha/(\alpha+n-1) $, draw category c from $ H_{c} $.
+
+* with probability $ n_{c}/(\alpha + n - 1) $, draw category c, where $ n_{c} $ is the number of previous items with category c.
+
+ $ ^{*} $ randomly sample an assigned item matching category c subject to streaming constraints.
+
+where $ \alpha $ is uniformly sampled at random from (1.0, 500.0).
+
+The results can be found in Table 2. We always ablate $ rab^{p,t} $ for HSTU as this dataset does not have timestamps. We observe HSTU increasing Hit Rate@10 by more than 100% relative to standard Transformers. Importantly, replacing HSTU's pointwise attention mechanism with softmax ("HSTU w/ Softmax") also leads to a significant reduction in hit rate, verifying the importance of pointwise attention-like aggregation mechanisms.
+
+### D. Evaluation: Traditional Sequential Recommender Settings
+
+Our evaluations in Section 4.1.1 focused on comparing HSTU with a state-of-the-art Transformer baseline, SASRec, utilizing latest training recipe. In this section, we further consider two other alternative approaches.
+
+Recurrent neural networks (RNNs). We consider the classical work on sequential recommender, GRU4Rec (Hidasi et al., 2016), to help readers understand how self-attention models, including Transformers and HSTU, compare to traditional RNNs, when all the latest modeling and training improvements are fully incorporated.
+
+Self-supervised sequential approaches. We consider the most popular work, BERT4Rec (Sun et al., 2019), to understand how bidirectional self-supervision (leveraged in BERT4Rec via a Cloze objective) compares with unidirectional causal autoregressive settings, such as SASRec and HSTU.
+Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
+
+
+
+ | Method | HR@10 | HR@50 | HR@200 | NDCG@10 | NDCG@200 |
| ML-1M | SASRec (2023) | .2853 | .5474 | .7528 | .1603 | .2498 |
| BERT4Rec | .2843 (-0.4%) | - | - | .1537 (-4.1%) | - |
| GRU4Rec | .2811 (-1.5%) | - | - | .1648 (+2.8%) | - |
| HSTU | .3097 (+8.6%) | .5754 (+5.1%) | .7716 (+2.5%) | .1720 (+7.3%) | .2606 (+4.3%) |
| HSTU-large | .3294 (+15.5%) | .5935 (+8.4%) | .7839 (+4.1%) | .1893 (+18.1%) | .2771 (+10.9%) |
| ML-20M | SASRec (2023) | .2906 | .5499 | .7655 | .1621 | .2521 |
| BERT4Rec | .2816 (-3.4%) | - | - | .1703 (+5.1%) | - |
| GRU4Rec | .2813 (-3.2%) | - | - | .1730 (+6.7%) | - |
| HSTU | .3252 (+11.9%) | .5885 (+7.0%) | .7943 (+3.8%) | .1878 (+15.9%) | .2774 (+10.0%) |
| HSTU-large | .3567 (+22.8%) | .6149 (+11.8%) | .8076 (+5.5%) | .2106 (+30.0%) | .2971 (+17.9%) |
| Books | SASRec (2023) | .0292 | .0729 | .1400 | .0156 | .0350 |
| HSTU | .0404 (+38.4%) | .0943 (+29.5%) | .1710 (+22.1%) | .0219 (+40.6%) | .0450 (+28.6%) |
| HSTU-large | .0469 (+60.6%) | .1066 (+46.2%) | .1876 (+33.9%) | .0257 (+65.8%) | .0508 (+45.1%) |
+
+Table 12. Evaluations of methods on public datasets in traditional sequential recommender settings (multi-pass, full-shuffle). Compared with Table 4, two other baselines (GRU4Rec and BERT4Rec) are included for completeness.
+
+
+Results are presented in Table 12. We reuse BERT4Rec results and GRU4Rec results on ML-1M and ML-20M as reported by Klenitskiy & Vasilev (2023). Given a sampled softmax loss is used, we hold the number of negatives used constant (128 for ML-1M, ML-20M and 512 for Amazon Books) to ensure a fair comparison between methods.
+
+The results confirm that SASRec remains one of the most competitive approaches in traditional sequential recommendation settings when sampled softmax loss is used (Zhai et al., 2023a; Klenitskiy & Vasilev, 2023), while HSTU significantly outperforms evaluated transformers, RNNs, and self-supervised bidirectional transformers.
+
+### E. Evaluation: Traditional DLRM Baselines
+
+The DLRM baseline configurations used in Section 4 reflect continued iterations of hundreds of researchers and engineers over multiple years and a close approximation of production configurations on a large internet platform with billions of daily active users before HSTUs/GRs were deployed. We give a high level description of the models used below.
+
+Ranking Setting. The baseline ranking model, as described in (Mudigere et al., 2022), employs approximately one thousand dense features and fifty sparse features. We incorporated various modeling techniques such as Mixture of Experts (Ma et al., 2018), variants of Deep & Cross Network (Wang et al., 2021), various sequential recommendation modules including target-aware pairwise attention (one commonly used variant in industrial settings can be found in (Zhou et al., 2018)), and residual connection over special interaction layers (He et al., 2015; Zhang et al., 2022). For the low FLOPs regime in the scaling law section (Section 4.3.1), some modules with high computational costs were simplified and/or replaced with other state-of-the-art variants like DCNs to achieve desired FLOPs.
+
+While we cannot disclose the exact settings due to confidentiality considerations, to the best of our knowledge, our baseline represents one of the best known DLRM approaches when recent research are fully incorporated. To validate this claim and to facilitate readers' understanding, we report a typical setup based on identical features but only utilizing major published results including DIN (Zhou et al., 2018), DCN (Wang et al., 2021), and MMoE (Ma et al., 2018) ("DLRM (DIN+DCN)") in Table 7, with the combined architecture illustrated in Figure 9. This setup significantly underperformed our production DLRM setup by 0.71% in NE for the main E-Task and 0.57% in NE for the main C-Task (where 0.1% NE is significant).
+
+Retrieval Setting. The baseline retrieval model employs a standard two-tower neutral retrieval setting (Covington et al., 2016) with mixed in-batch and out-of-batch sampling. The input feature set consists of both high cardinality sparse features (e.g., item ids, user ids) and low cardinality sparse features (e.g. languages, topics, interest entities). A stack of feed forward layers with residual connections (He et al., 2015) is used to compress the input features into user and item embeddings.
+
+Features and Sequence Length. The features used in both of the DLRM baselines, including main user interaction history that is utilized by various sequential encoder/pairwise attention modules, are strict supersets of the features used in all GR candidates. This applies to all studies conducted in this paper, including those used in the scaling studies (Section 4.3.1).
+
Figure 2. Comparison of features and training procedures: DLRMs vs GRs. $ E, F, G, H $ denote categorical features. $ \Phi_i $ represents the $ i $-th item in the merged main time series. $ \Psi_k(t_j) $ denotes training example $ k $ emitted at time $ t_j $. Full notations can be found in Appendix A.
@@ -74,7 +338,173 @@ Ranking. Ranking tasks in GRs pose unique challenges as industrial recommendatio
Industrial recommenders are commonly trained in a streaming setup, where each example is processed sequentially as they become available. In this setup, the total computational requirement for self-attention based sequential transduction architectures, such as Transformers (Vaswani et al., 2017), scales as $ \sum_{i} n_i (n_i^2 d + n_i d_{ff} d) $, where $ n_i $ is the number of tokens of user $ i $, and $ d $ is the embedding dimension. The first part in the parentheses comes from self-attention, with assumed $ O(n^2) $ scaling factor due to most subquadratic algorithms involving quality tradeoffs and underperforming quadratic algorithms in wall-clock time (Dao et al., 2022). The second part comes from pointwise MLP layers, with hidden layers of size $ O(d_{ff}) = O(d) $. Taking $ N = \max_i n_i $, the overall time complexity reduces to $ O(N^3 d + N^2 d^2) $, which is cost prohibitive for recommendation settings.
-To tackle the challenge of training sequential transduc-tion models over long sequences in a scalable manner, we move from traditional impression-level training to generative training, reducing the computational complexity by an $ O(N) $ factor, as shown at the top of Figure 2. By doing so, encoder costs are amortized across multiple targets. More specifically, when we sample the $ i $-th user at rate $ s_u(n_i) $, the total training cost now scales as $ \sum_i s_u(n_i) n_i (n_i^2 d + n_i d^2) $, which is reduced to $ O(N^2 d + N d^2) $ by setting $ s_u(n_i) $ to $ 1/n_i $. One way to implement this sampling in industrial-scale systems is to emit training examples at the end of a user's request or session, resulting in $ \hat{s}_u(n_i) \propto 1/n_i $.
+To tackle the challenge of training sequential transduc-
+
+
+
+Figure 9. A high level architecture of a baseline DLRM ranking model ("DLRM (DIN+DCN)" in Table 7) that utilizes major published work including DIN (Zhou et al., 2018), DCN (Wang et al., 2021), and MMoE (Ma et al., 2018).
+
+
+
+| Metric Name | Selection Type |
| Greedy | Weighted | Random |
| Main Engagement Metric (NE) | 0.495 | 0.494 | 0.495 |
| Main Consumption Metric (NE) | 0.792 | 0.789 | 0.791 |
+
+Table 13. Comparison of subsequence selection methods for Stochastic Length on model quality, measured by Normalized Entropy (NE).
+
+
+### F. Stochastic Length
+
+#### F.1. Subsequence Selection
+
+In Equation (4), we select a subsequence of length L from the full user history in order to increase sparsity. Our empirical results indicate that careful design of the subsequence selection technique can improve model quality. We compute a metric $ f_{i}=t_{n}-t_{i} $ which corresponds to the amount of time elapsed since the user interacted with item $ x_{i} $. We conduct offline experiments with the following subsequence selection methods:
+
+• Greedy Selection – Selects L items with smallest values of $ f_{i} $ from S
+
+• Random Selection – Selects L items from S randomly
+
+• Feature-Weighted Selection – Selects L items from S according to a weighted distribution $ 1 - f_{n,i}/(\sum_{j=1}^{L}f_{j,i}) $
+
+During our offline experiments, the feature-weighted subsequence selection method resulted in the best model quality, as shown in Table 13.
+
+#### F.2. Impact of Stochastic Length on Sequence Sparsity
+
+In Table 3, we show the impact of Stochastic Length on sequence sparsity for a representative industry-scale configuration with 30-day user engagement history. The sequence sparsity is defined as one minus the ratio of the average sequence length of all samples divided by the maximum sequence length. To better characterize the computational cost of sparse attentions, we also define $ s_{2} $, which is defined as one minus the sparsity of the attention matrix. For reference, we present the results for 60-day and 90-day user engagement history in Table 14 and Table 15, respectively.
+
+| Alpha | Max Sequence Length |
| 1,024 | 2,048 | 4,096 | 8,192 |
| sparsity | s2 | sparsity | s2 | sparsity | s2 | sparsity | s2 |
| 1.6 | 71.5% | 89.4% | 75.8% | 92.3% | 79.4% | 94.7% | 83.8% | 97.3% |
| 1.7 | 57.3% | 77.6% | 60.6% | 79.8% | 67.3% | 86.6% | 74.5% | 93.3% |
| 1.8 | 37.5% | 56.2% | 42.6% | 62.1% | 51.9% | 74.2% | 62.6% | 85.5% |
| 1.9 | 15.0% | 25.2% | 17.7% | 29.0% | 29.6% | 47.5% | 57.8% | 80.9% |
| 2.0 | 1.2% | 1.7% | 2.5% | 3.5% | 18.9% | 30.8% | 57.6% | 80.6% |
+
+Table 14. Impact of Stochastic Length (SL) on sequence sparsity, over a 60d user engagement history.
+
+
+
+| Alpha | Max Sequence Length |
| 1,024 | 2,048 | 4,096 | 8,192 |
| sparsity | s2 | sparsity | s2 | sparsity | s2 | sparsity | s2 |
| 1.6 | 68.0% | 85.0% | 74.6% | 90.8% | 78.6% | 93.5% | 83.5% | 97.3% |
| 1.7 | 56.3% | 76.1% | 61.2% | 80.6% | 67.5% | 87.0% | 74.3% | 93.3% |
| 1.8 | 38.9% | 58.3% | 42.0% | 61.3% | 50.4% | 72.4% | 61.0% | 84.4% |
| 1.9 | 16.2% | 27.3% | 17.3% | 28.6% | 27.2% | 44.4% | 54.3% | 77.8% |
| 2.0 | 0.9% | 1.2% | 1.6% | 2.1% | 13.5% | 22.5% | 54.0% | 77.4% |
+
+Table 15. Impact of Stochastic Length (SL) on sequence sparsity, over a 90d user engagement history.
+
+
+#### F.3. Comparisons Against Sequence Length Extrapolation Techniques
+
+We conduct additional studies to verify that Stochastic Length is competitive against existing techniques for sequence length extrapolation used in language modeling. Many existing methods perform sequence length extrapolation through modifications of RoPE (Su et al., 2023). To compare against existing methods, we train an HSTU variant (HSTU-RoPE) with no relative attention bias and rotary embeddings.
+
+We evaluate the following sequence length extrapolation methods on HSTU-RoPE:
+
+• Zero-Shot - Apply NTK-Aware RoPE (Peng et al., 2024) before directly evaluating the model with no finetuning;
+
+• Fine-tune - Finetune the model for 1000 steps after applying NTK-by-parts (Peng et al., 2024).
+
+We evaluate the following sequence length extrapolation methods on HSTU (includes relative attention bias, no rotary embeddings):
+
+- Zero-Shot - Clamp the relative position bias according to the maximum training sequence length, directly evaluate the model (Raffel et al., 2020; Press et al., 2022);
+
+Fine-tune - Clamp the relative position bias according to the maximum training sequence length, fine-tune the model for 1000 steps before evaluating the model.
+
+
+
+
+
+
+
+
+
+
+
+
+
+Figure 10. Impact of Stochastic Length (SL) on ranking model metrics. Left to right: n = [1024, 2048, 4096, 8192] (n is after interleaving algorithm as discussed in Section 2.2 to enable target-aware cross attention in causal-masked settings).
+
+Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
+
+
+
+| Evaluation Strategy | Average NE Difference vs Full Sequence Baseline |
| Model Type | 2048 / 52% Sparsity | 4096 / 75% Sparsity |
| Zero-shot | HSTU (Raffel et al., 2020) | 6.46% | 10.35% |
| HSTU-RoPE (Peng et al., 2024) | 7.51% | 11.27% |
| Fine-tune | HSTU (Raffel et al., 2020) | 1.92% | 2.21% |
| HSTU-RoPE (Peng et al., 2024) | 1.61% | 2.19% |
| Stochastic Length (SL) | HSTU | 0.098% | 0.64% |
+
+Table 16. Comparisons of Stochastic Length (SL) vs existing Length Extrapolation methods.
+
+
+In Table 16, we report the NE difference between models with induced data sparsity during training (Stochastic Length, zero-shot, fine-tuning) and models trained on the full data. We define the sparsity for zero-shot and fine-tuning techniques to be the average sequence length during training divided by the max sequence length during evaluation. All zero-shot and fine-tuned models are trained on 1024 sequence length data and are evaluated against 2048 and 4096 sequence length data. In order to find an appropriate Stochastic Length baseline for these techniques, we select Stochastic Length settings which result in the same data sparsity metrics.
+
+We believe that zero-shot and fine-tuning approaches to sequence length extrapolation are not well-suited for recommendation scenarios that deal with high cardinality ids. Empirically, we observe that Stochastic Length significantly outperforms fine-tuning and zero-shot approaches. We believe that this could be due to our large vocabulary size. Zero-shot and fine-tuning approaches fail to learn good representations for older ids, which could hurt their ability to fully leverage the information contained in longer sequences.
+
+### G. Sparse Grouped GEMMs and Fused Relative Attention Bias
+
+We provide additional information about the efficient HSTU attention kernel that was introduced in Section 3.2. Our approach builds upon Memory-efficient Attention (Rabe & Staats, 2021) and FlashAttention (Dao et al., 2022), and is a memory-efficient self-attention mechanism that divides the input into blocks and avoids materializing the large $ h \times N \times N $ intermediate attention tensors for the backward pass. By exploiting the sparsity of input sequences, we can reformulate the attention computation as a group of back-to-back GEMMs with different shapes. We implement efficient GPU kernels to accelerate this computation. The construction of the relative attention bias is also a bottleneck due to memory accesses. To address this issue, we have fused the relative bias construction and the grouped GEMMs into a single GPU kernel and managed to accumulate gradients using GPU's fast shared memory in the backward pass. Although our algorithm requires recomputing attention and relative bias in the backward pass, it is significantly faster and uses less memory than the standard approach used in Transformers.
+### H. Microbatched-Fast Attention Leveraging Cacheable OperationNs (M-FALCON)
+
+In this section, we provide a detailed description of the M-FALCON algorithm discussed in Section 3.4. We give pseudocode for M-FALCON in Algorithm 1. M-FALCON introduces three key ideas.
+
+
+
+
+(a) GR's ranking model training (with $ n = 2n_{c} $ tokens), in causal autoregressive settings.
+
+
+
+
+
+(b) GR's ranking model inference utilizing the M-FALCON algorithm.
+
+
+Figure 11. Illustration of the M-FALCON algorithm. Top: model training in GR’s target-aware formulation. Bottom: model inference with $m$ candidates $\Phi'_0, \ldots, \Phi'_{m-1}$, divided into $\lceil m/b_m \rceil$ microbatches, where we show model inference for the first microbatch $\Phi'_0, \ldots, \Phi'_{b_{m-1}}$ (with $2n_c + b_m$ total tokens after $\Phi_0, a_0, \ldots, \Phi_{n_c-1}, a_{n_c-1}$ are taken into account) above the dotted line. Note that the self-attention algorithm is modified such that $\Phi'_i$ cannot attend to $\Phi'_j$ when $i \neq j$ - this is highlighted with “×” in the figure.
+
+
+Batched inference can be applied to causal autoregressive settings. The ranking task in GR is formulated in a target aware fashion as discussed Section 2.2. Common wisdom suggests that in a target-aware setting, we need to perform inference for one item at a time, with a cost of $ O(mn^2d) $ for m candidates and a sequence length of n. Here we show that this is not the optimal solution; even with vanilla Transformers, we can modify the attention mask used in self-attention to batch such operations (“batched inference”) and reduce cost to $ O((n+m)^2d) = O(n^2d) $.
+
+An illustration is provided in Figure 11. Here, both Figure 11 (a) and (b) involve an attention mask matrix for causal autoregressive settings. The key difference is that Figure 11 (a) uses a standard lower triangular matrix of size $ 2n_{c} $ for causal autoregressive settings.
+training, whereas Figure 11 (b) modifies a lower triangular matrix of size $ 2n_c + b_m $ by setting entries for $ (i, j) $s where $ i, j \geq 2n_c $, $ i \neq j $ to False or $ -\infty $ to prevent target positions $ \Phi'_0, \ldots, \Phi_{b_{m-1}} $ from attending to each other. It is easy to see that by doing so, the output of the self-attention block for $ \Phi'_i $, $ a'_i $, only depends on $ \Phi_0 $, $ a_0 $, $ \ldots $, $ \Phi_{n_c-1} $, $ a_{n_c-1} $, but not on $ \Phi'_j $ ( $ i \neq j $). In other words, by making a forward pass over $ (2n_c + b_m) $ tokens using the modified attention mask, we can now obtain the same results for the last $ b_m $ tokens as if we've made $ b_m $ separate forward passes over $ (2n_c + 1) $ tokens, with $ \Phi'_i $ placed at the $ 2n_c $-th (0-based) position during the $ i $-th forward pass utilizing a standard causal attention mask.
+
+Microbatching scales batched inference to large candidate sets. Ranking stage may need to deal with a large number of ranking candidates, up to tens of thousands (Wang et al., 2020). We can divide the overall $m$ candidates into $\lceil m/b_m \rceil$ microbatches of size $b_m$ such that $O(b_m) = O(n)$, which retains the $O((n + m)^2 d) = O(n^2 d)$ running time previously discussed for most practical recommender settings, up to tens of thousands of candidates.
+
+Encoder-level caching enables compute sharing within and across requests. Finally, KV caching (Pope et al., 2022) can be applied both within and across requests. For instance, for the HSTU model presented in this work (Section 3), $ K(X) $ and $ V(X) $ are fully cachable across microbatches within and/or across requests. For a cached forward pass, we only need to compute $ U(X) $, $ Q(X) $, $ K(X) $, and $ V(X) $ for the last $ b_m $ tokens, while reusing cached $ K(X) $ and $ V(X) $ for the sequentialized user history containing n tokens. $ f_2(\text{Norm}(A(X)V(X)) \odot U(X)) $ similarly only needs to be recomputed for the $ b_m $ candidates. This reduces the cached forward pass's computational complexity to $ O(b_m d^2 + b_m nd) $, which significantly improves upon $ O((n + b_m)d^2 + (n + b_m)^2 d) $ by a factor of 2-4 even when $ b_m = n $.
+
+Algorithm 1 M-FALCON Algorithm.
+
+1: Input: Merged token series $ x_0, x_1, \ldots, x_{n-1} $ (can be e.g., $ (\Phi_0, a_0, \ldots, \Phi_{n_c-1}, a_{n_c-1}) $ where $ n = 2n_c $); m ranking candidates $ \Phi'_0, \ldots, \Phi'_{m-1} $; a b-layer h-heads self-attention model trained in causal autoregressive settings (e.g., HSTU or Transformers) $ f(X, cacheStates, attnMask) \to (X', updatedCacheStates) $ where $ X, X' \in \mathbb{R}^{N \times d} $, attnMask $ \in \mathbb{R}^{N \times N} $, and cachedStates, updatedCacheStates $ \in \mathbb{R}^{b \times h \times N \times d_{qk}} \times \mathbb{R}^{b \times h \times N \times d_{qk}} $ (due to caching $ K(X) $s and $ V(X) $s across b layers); microbatch size $ b_m $, where we assume m is a multiple of $ b_m $ for simplicity.
+
+2: Output: Predictions for all m ranking candidates, $ (a'_0, \ldots, a'_{m-1}) $.
+
+3: numMicrobatches = $ (m + b_m - 1) // b_m $
+
+4: attnMask = $ L_{n+b_m} $ $ \{L_{n+b_m}\} $ represents a lower triangular matrix. Lower triangular entries are 0s, the rest are $ -\infty $.
+
+5: attnMask[i, j] = - $ \infty $ for i, j $ \geq n $, i $ \neq j $ $ \{This $ prevents the last $ b_m $ entries from attending to each other.\}
+
+6: $ (a'_0, a'_1, \ldots, a'_{b_m-1}) $, $ kvCache \leftarrow f(embLayer((x_0, x_1, \ldots, x_{n-1}, \Phi'_0, \ldots, \Phi'_{b_m-1})), \varnothing, attnMask) $
+
+7: predictions = $ (a'_0, a'_1, \ldots, a'_{b_m-1}) $
+
+8: i = 1
+
+9: while i < numMicrobatches do
+
+10: $ (a'_{b_m, i}, a'_{b_m+1}, a'_{b_m(i+1)-1}), \ldots \leftarrow f(embLayer((x_0, x_1, \ldots, x_{n-1}, \Phi'_{b_m, i}, \ldots, \Phi'_{b_m(i+1)-1})), kvCache, attnMask) $
+
+11: predictions $ \leftarrow $ predictions + $ (a'_{b_m, i}, a'_{b_m+1}, \ldots, a'_{b_m(i+1)-1}) $
+
+12: i $ \leftarrow $ i + 1
+
+13: end while
+
+14: return predictions
+
+Algorithm 1 is illustrated in Figure 11 to help with understanding. We remark that M-FALCON is not only applicable to HSTUs and GRs, but also broadly applicable as an inference optimization algorithm for other target-aware causal autoregressive settings based on self-attention architectures.
+
+#### H.1. Evaluation of Inference Throughput: Generative Recommenders (GRs) w/ M-FALCON vs DLRMs
+
+As discussed in Section 3.4, M-FALCON handles $ b_{m} $ candidates in parallel to amortize computation costs across all m candidates at inference time. To understand our design, we compare the throughput (i.e., the number of candidates scored per second, QPS) of GRs and DLRMs based on the same hardware setups.
+
+As shown in Figure 12 and Figure 13, GRs' throughput scales in a sublinear way based on the number of ranking-stage candidates (m), up to a certain region - m = 2048 in our case study - due to batched inference enabling cost amortization. This confirms the criticality of batched inference in causal autoregressive settings. Due to attention complexity scaling as $ O((n + b_m)^2) $, leveraging multiple microbatches by itself improves throughput. Caching further eliminates redundant linear and attention computations on top of microbatching. The two combined resulted in up to 1.99x additional speedups relative
+to the $ b_{m} = m = 1024 $ baseline using a single microbatch, as shown in Figure 13. Overall, with the efficient HSTU encoder design and utilizing M-FALCON, HSTU-based Generative Recommenders outperform DLRMs in terms of throughput on a large-scale production setup by up to 2.99x, despite GRs being 285x more complex in terms of FLOPs.
+
+
+
+
+Figure 12. End-to-end inference throughput: DLRMs vs GRs (w/ M-FALCON) in large-scale industrial settings. Note that this figure is the same as Figure 6, and is reproduced here to facilitate reading.
+
+
+
+
+
+Figure 13. End-to-end inference throughput: M-FALCON throughput scaling, on top of the 285x FLOPs GR model, in large batch settings where m (total number of ranking candidates) ranges from 1024 to 16384, and $ b_{m} = 1024 $.
+
+tion models over long sequences in a scalable manner, we move from traditional impression-level training to generative training, reducing the computational complexity by an $ O(N) $ factor, as shown at the top of Figure 2. By doing so, encoder costs are amortized across multiple targets. More specifically, when we sample the $ i $-th user at rate $ s_u(n_i) $, the total training cost now scales as $ \sum_i s_u(n_i) n_i (n_i^2 d + n_i d^2) $, which is reduced to $ O(N^2 d + N d^2) $ by setting $ s_u(n_i) $ to $ 1/n_i $. One way to implement this sampling in industrial-scale systems is to emit training examples at the end of a user's request or session, resulting in $ \hat{s}_u(n_i) \propto 1/n_i $.
### 3. A High Performance Self-Attention Encoder for Generative Recommendations
@@ -92,7 +522,7 @@ HSTU encoder design allows for the replacement of heterogeneous modules in DLRMs
Feature Interaction is the most critical part of DLRMs. Common approaches used include factorization machines and their neural network variants (Rendle, 2010; Guo et al., 2017; Xiao et al., 2017), higher order feature interactions (Wang et al., 2021), etc. HSTU replaces feature interactions by enabling attention pooled features to directly “interact” with other features via Norm $ (A(X)V(X)) \odot U(X) $.
-
+
Figure 3. Comparison of key model components: DLRMs vs GRs. The complete DLRM setup (Mudigere et al., 2022) is shown on the left side and a simplified HSTU is shown on the right.
@@ -106,7 +536,8 @@ Transformations of Representations is commonly done with Mixture of Experts (MoE
HSTU adopts a new pointwise aggregated (normalized) attention mechanism (in contrast, softmax attention computes normalization factor over the entire sequence). This is motivated by two factors. First, the number of prior data points related to target serves as a strong feature indicating the intensity of user preferences, which is hard to capture after softmax normalization. This is critical as we need to predict both the intensity of engagements, e.g., time spent on a given item, and the relative ordering of the items, e.g., predicting an ordering to maximize AUC. Second, while softmax activation is robust to noise by construction, it is less suited for non-stationary vocabularies in streaming settings.
-The proposed pointwise aggregated attention mechanism is depicted in Equation (2). Importantly, layer norm is needed after pointwise pooling to stabilize training. One way toActions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
+The proposed pointwise aggregated attention mechanism is depicted in Equation (2). Importantly, layer norm is needed after pointwise pooling to stabilize training. One way to
+Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
@@ -149,7 +580,8 @@ Compared to Transformers, HSTU employs a simplified and fully fused design that
For comparison, Transformers use a feedforward layer and dropout after attention (intermediate state of $ 3hd_v $), followed by a pointwise feedforward block consisting of layer norm, linear, activation, linear, and dropout, with intermediate states of $ 2d + 4d_{ff} + 2d + 1d = 4d + 4d_{ff} $. Here, we make standard assumptions that $ hd_v \geq d $ and that $ d_{ff} = 4d $ (Vaswani et al., 2017; Brown et al., 2020). Thus, after accounting for input and input layer norm (4d) and qkv projections, the total activation states is 33d. HSTU's design hence enables scaling to > 2x deeper layers.
-Additionally, large scale atomic ids used to represent vocabularies also require significant memory usage. With a 10b vocabulary, 512d embeddings, and Adam optimizer, storing embeddings and optimizer states in fp32 already requires 60TB memory. To alleviate memory pressure, we employ rowwise AdamW optimizers (Gupta et al., 2014; Khudiaet al., 2021) and place optimizer states on DRAM, which reduces HBM usage per float from 12 bytes to 2 bytes.
+Additionally, large scale atomic ids used to represent vocabularies also require significant memory usage. With a 10b vocabulary, 512d embeddings, and Adam optimizer, storing embeddings and optimizer states in fp32 already requires 60TB memory. To alleviate memory pressure, we employ rowwise AdamW optimizers (Gupta et al., 2014; Khudia
+et al., 2021) and place optimizer states on DRAM, which reduces HBM usage per float from 12 bytes to 2 bytes.
#### 3.4. Scaling up inference via cost-amortization
@@ -183,7 +615,8 @@ We show results in Table 5. First, HSTU significantly outperforms Transformers,
#### 4.2. Encoder Efficiency
-Stochastic Length. Figure 4 and Figure 5 (a) show the impact of stochastic length (SL) on model metrics. At $ \alpha = 1.6 $, a sequence of length 4096 is turned into a sequence of length 776 the majority of the time, or removing more than 80% of the tokens. Even after sparsity ratio increases to 64%–84%, the NEs we obtained for main tasks did not degrade byActions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
+Stochastic Length. Figure 4 and Figure 5 (a) show the impact of stochastic length (SL) on model metrics. At $ \alpha = 1.6 $, a sequence of length 4096 is turned into a sequence of length 776 the majority of the time, or removing more than 80% of the tokens. Even after sparsity ratio increases to 64%–84%, the NEs we obtained for main tasks did not degrade by
+Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
Table 4. Evaluations of methods on public datasets in multi-pass, full-shuffle settings.
@@ -198,10 +631,10 @@ Stochastic Length. Figure 4 and Figure 5 (a) show the impact of stochastic lengt
| Architecture | Retrieval log pplx. | Ranking (NE) |
| E-Task | C-Task |
| Transformers | 4.069 | NaN | NaN |
| HSTU ( $ -rab^{{p,t}} $, Softmax) | 4.024 | .5067 | .7931 |
| HSTU ( $ -rab^{{p,t}} $) | 4.021 | .4980 | .7860 |
| Transformer++ | 4.015 | .4945 | .7822 |
| HSTU (original rab) | 4.029 | .4941 | .7817 |
| HSTU | 3.978 | .4937 | .7805 |
-
+
-
+
Figure 4. Impact of Stochastic Length (SL) on metrics. Left: n = 4096. Right: n = 8192. Full results can be found in Appendix F.
@@ -231,19 +664,20 @@ Lastly, we compare the end-to-end performance of GRs against state-of-the-art DL
As discussed in Section 2, GRs build upon raw categorical engagement features, while DLRMs are typically trained with a significantly larger number of features, the majority of which are handcrafted from raw signals. If we give the same set of features used in GRs to DLRMs (“DLRM (abl. features)”), the performance of DLRMs is significantly degraded, which suggests GRs can meaningfully capture those features via their architecture and unified feature space.
-We further validate the GR formulation in Section 2.2 by comparing it with a traditional sequential recommender setup that only considers items user interacted with (Kang
+We further validate the GR formulation in Section 2.2 by comparing it with a traditional sequential recommender setup that only considers items user interacted with (Kang
+
(a) Training NE.
-
+
(b) Training Speedup.
-
+
(c) Inference Speedup.
@@ -262,7 +696,7 @@ We finally compare the efficiency of GRs with our production DLRMs in Figure 6.
It is commonly known that in large-scale industrial settings, DLRMs saturate in quality at certain compute and params
-
+
Figure 6. Comparison of inference throughput, in the most challenging ranking setup. Full results can be found in Appendix H.1. regimes (Zhao et al., 2023). We compare the scalability of GRs and DLRMs to better understand this phenomenon.
@@ -272,13 +706,14 @@ Since feature interaction layers are crucial for DLRM's performance (Mudigere et
Results are shown in Figure 7. In the low compute regime, DLRMs might outperform GRs due to handcrafted features, corroborating the importance of feature engineering in traditional DLRMs. However, GRs demonstrate substantially better scalability with respect to FLOPs, whereas DLRM performance plateaus, consistent with findings in prior work. We also observe better scalability w.r.t. both embedding parameters and non-embedding parameters, with GRs leading to 1.5 trillion parameter models, whereas DLRMs performance saturates at about 200 billion parameters.
-Finally, all of our main metrics, including Hit Rate@100 and Hit Rate@500 for retrieval, and NE for ranking, empirically scale as a power law of compute used given appropriate hyperparameters. We observe this phenomenon across three orders of magnitude, up till the largest models we were able to test (8,192 sequence length, 1,024 embedding dimension, 24 layers of HSTU), at which point the total amount of compute we used (normalized over 365 days as we use a standard streaming training setting) is close to the total training compute used by GPT-3 (Brown et al., 2020) and LLaMA2 (Touvron et al., 2023b), as shown in Figure 1. Within a reasonable range, the exact model hyperparameters play less important roles compared to the total amount of training compute applied. In contrast to language modeling (Kaplan et al., 2020), sequence length play a significantly more important role in GRs, and it's important to scale up sequence
+Finally, all of our main metrics, including Hit Rate@100 and Hit Rate@500 for retrieval, and NE for ranking, empirically scale as a power law of compute used given appropriate hyperparameters. We observe this phenomenon across three orders of magnitude, up till the largest models we were able to test (8,192 sequence length, 1,024 embedding dimension, 24 layers of HSTU), at which point the total amount of compute we used (normalized over 365 days as we use a standard streaming training setting) is close to the total training compute used by GPT-3 (Brown et al., 2020) and LLaMA2 (Touvron et al., 2023b), as shown in Figure 1. Within a reasonable range, the exact model hyperparameters play less important roles compared to the total amount of training compute applied. In contrast to language modeling (Kaplan et al., 2020), sequence length play a significantly more important role in GRs, and it's important to scale up sequence
+
-
+
-
+
Figure 7. Scalability: DLRMs vs GRs in large-scale industrial settings across retrieval (top, middle) and ranking (bottom). +0.005 in HR and -0.001 in NE represent significant improvements.
@@ -300,7 +735,8 @@ Interests in large language models (LLMs) have motivated work to treat various r
We have proposed Generative Recommenders (GRs), a new paradigm that formulates ranking and retrieval as sequential transduction tasks, allowing them to be trained in a generative manner. This is made possible by the novel HSTU encoder design, which is 5.3x-15.2x faster than state-of-the-art Transformers on 8192 length sequences, and through the use of new training and inference algorithms such as M-FALCON. With GRs, we deployed models that are 285x more complex while using less inference compute. GRs and HSTU have led to 12.4% metric improvements in production and have shown superior scaling performance compared to traditional DLRMs. Our results corroborate that user actions represent an underexplored modality in generative modeling – to echo our title, “Actions speak louder than words”.
-The dramatic simplification of features in our work paves the way for the first foundation models for recommendations, search, and ads by enabling a unified feature space to be used across domains. The fully sequential setup of GRs also enables recommendation to be formulated in an end-to-end, generative setting. Both of these enable recommendation systems to better assist users holistically.## IMPACT STATEMENTS
+The dramatic simplification of features in our work paves the way for the first foundation models for recommendations, search, and ads by enabling a unified feature space to be used across domains. The fully sequential setup of GRs also enables recommendation to be formulated in an end-to-end, generative setting. Both of these enable recommendation systems to better assist users holistically.
+## IMPACT STATEMENTS
We believe that our work has broad positive implications. Reducing reliance of recommendation, search, and ads systems on the large number of heterogeneous features can make these systems much more privacy-friendly while improving user experiences. Enabling recommendation systems to attribute users' long-term outcomes to short-term decisions via fully sequential formulations could reduce the prevalence of content that do not serve users' long-term goals (including clickbaits and fake news) across the web, and better align incentives of platforms with user values. Finally, applications of foundation models and scaling law can help reduce carbon footprints incurred with model research and developments needed for recommendations, search, and related use cases.
@@ -324,414 +760,4 @@ Cheng, H.-T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., Anders
Child, R., Gray, S., Radford, A., and Sutskever, I. Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019. URL http://arxiv.org/abs/1904.10509.
-Covington, P., Adams, J., and Sargin, E. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, RecSys '16, pp. 191–198, 2016. ISBN 9781450340359.Cui, Z., Ma, J., Zhou, C., Zhou, J., and Yang, H. M6-rec: Generative pretrained language models are open-ended recommender systems, 2022.
-
-Dallmann, A., Zoller, D., and Hotho, A. A case study on sampling strategies for evaluating neural sequential item recommendation models. In Proceedings of the 15th ACM Conference on Recommender Systems, RecSys '21, pp. 505–514, 2021. ISBN 9781450384582.
-
-Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023.
-
-Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Ré, C. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
-
-Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, 2019. doi:10.18653/v1/n19-1423. URL https://doi.org/10.18653/v1/n19-1423.
-
-Eksombatchai, C., Jindal, P., Liu, J. Z., Liu, Y., Sharma, R., Sugnet, C., Ulrich, M., and Leskovec, J. Pixie: A system for recommending 3+ billion items to 200+ million users in real-time. In Proceedings of the 2018 World Wide Web Conference, WWW '18, pp. 1775–1784, 2018. ISBN 9781450356398.
-
-Elfwing, S., Uchibe, E., and Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. CoRR, abs/1702.03118, 2017. URL http://arxiv.org/abs/1702.03118.
-
-Gao, W., Fan, X., Wang, C., Sun, J., Jia, K., Xiao, W., Ding, R., Bin, X., Yang, H., and Liu, X. Learning an end-to-end structure for retrieval in large-scale recommendations. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management, CIKM '21, pp. 524–533, 2021. ISBN 9781450384469.
-
-Gillenwater, J., Kulesza, A., Fox, E., and Taskar, B. Expectation-maximization for learning determinantal point processes. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS'14, pp. 3149–3157, Cambridge, MA, USA, 2014. MIT Press.
-
-Gu, A., Goel, K., and Ré, C. Efficiently modeling long sequences with structured state spaces. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=uYLFOz1vlAC.
-
-Guo, H., Tang, R., Ye, Y., Li, Z., and He, X. Deepfm: A factorization-machine based neural network for ctr prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI'17, pp. 1725–1731, 2017. ISBN 9780999241103.
-
-Gupta, M. R., Bengio, S., and Weston, J. Training highly multiclass classifiers. J. Mach. Learn. Res., 15(1):1461–1492, Jan 2014. ISSN 1532-4435.
-
-He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
-
-He, X., Pan, J., Jin, O., Xu, T., Liu, B., Xu, T., Shi, Y., Atallah, A., Herbrich, R., Bowers, S., and Candela, J. Q. Practical lessons from predicting clicks on ads at facebook. In ADKDD'14: Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, New York, NY, USA, 2014. Association for Computing Machinery. ISBN 9781450329996.
-
-Hidasi, B., Karatzoglou, A., Baltrunas, L., and Tikk, D. Session-based recommendations with recurrent neural networks. In Bengio, Y. and LeCun, Y. (eds.), 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1511.06939.
-
-Hou, Y., Zhang, J., Lin, Z., Lu, H., Xie, R., McAuley, J., and Zhao, W. X. Large language models are zero-shot rankers for recommender systems. In Advances in Information Retrieval - 46th European Conference on IR Research, ECIR 2024, 2024.
-
-Hua, W., Dai, Z., Liu, H., and Le, Q. V. Transformer quality in linear time. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., and Sabato, S. (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 9099–9117. PMLR, 2022. URL https://proceedings.mlr.press/v162/hua22a.html.
-
-Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger, K. Deep networks with stochastic depth, 2016.
-
-Jegou, H., Douze, M., and Schmid, C. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal.Mach. Intell. 321. ISSN 0162-8828.
-
-doi ://doi.
-
-Kang, W.-C. and McAuley, J. Self-attentive sequential recommendation. In 2018 International Conference on Data Mining (ICDM), pp. 197–206, 2018.
-
-Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361.
-
-Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning, ICML'20. JMLR.org, 2020.
-
-Khudia, D., Huang, J., Basu, P., Deng, S., Liu, H., Park, J., and Smelyanskiy, M. Fbgemm: Enabling high-performance low-precision deep learning inference. arXiv preprint arXiv:2101.05615, 2021.
-
-Klenitskiy, A. and Vasilev, A. Turning dross into gold loss: is bert4rec really better than sasrec? In Proceedings of the 17th ACM Conference on Recommender Systems, RecSys '23, pp. 1120–1125, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400702419. doi: 10.1145/3604915.3610644. URL https://doi.org/10.1145/3604915.3610644.
-
-Korthikanti, V., Casper, J., Lym, S., McAfee, L., Andersch, M., Shoeybi, M., and Catanzaro, B. Reducing activation recomputation in large transformer models, 2022.
-
-Li, J., Wang, M., Li, J., Fu, J., Shen, X., Shang, J., and McAuley, J. Text is all you need: Learning language representations for sequential recommendation. In KDD, 2023.
-
-Li, C., Chang, E., Garcia-Molina, H., and Wiederhold, G. Clustering for approximate similarity search in high-dimensional spaces. IEEE Transactions on Knowledge and Data Engineering, 14(4):792–808, 2002.
-
-Liu, Z., Zou, L., Zou, X., Wang, C., Zhang, B., Tang, D., Zhu, B., Zhu, Y., Wu, P., Wang, K., and Cheng, Y. Monolith: Real time recommendation system with collisionless embedding table, 2022.
-
-Ma, J., Zhao, Z., Yi, X., Chen, J., Hong, L., and Chi, E. H. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. KDD '18, 2018.
-
-Mudigere, D., Hao, Y., Huang, J., Jia, Z., Tulloch, A., Sridharan, S., Liu, X., Ozdal, M., Nie, J., Park, J., Luo,
-
-L., Yang, J. A., Gao, L., Ivchenko, D., Basant, A., Hu, Y., Yang, J., Ardestani, E. K., Wang, X., Komuravelli, R., Chu, C.-H., Yilmaz, S., Li, H., Qian, J., Feng, Z., Ma, Y., Yang, J., Wen, E., Li, H., Yang, L., Sun, C., Zhao, W., Melts, D., Dhulipala, K., Kishore, K., Graf, T., Eisenman, A., Matam, K. K., Gangidi, A., Chen, G. J., Krishnan, M., Nayak, A., Nair, K., Muthiah, B., Khorashadi, M., Bhattacharya, P., Lapukhov, P., Naumov, M., Mathews, A., Qiao, L., Smelyanskiy, M., Jia, B., and Rao, V. Software-hardware co-design for fast and scalable training of deep learning recommendation models. In Proceedings of the 49th Annual International Symposium on Computer Architecture, ISCA '22, pp. 993–1011, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450386104. doi: 10.1145/3470496.3533727. URL https://doi.org/10.1145/3470496.3533727.
-
-Peng, B., Quesnelle, J., Fan, H., and Shippole, E. YaRN: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=wHBfxhZulu.
-
-Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Levskaya, A., Heek, J., Xiao, K., Agrawal, S., and Dean, J. Efficiently scaling transformer inference, 2022.
-
-Press, O., Smith, N. A., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=R8sQPpGCv0.
-
-Rabe, M. N. and Staats, C. Self-attention does not need $ o(n^{2}) $ memory, 2021.
-
-Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1), jan 2020. ISSN 1532-4435.
-
-Rendle, S. Factorization machines. In 2010 IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010. doi: 10.1109/ICDM.2010.127.
-
-Rendle, S., Krichene, W., Zhang, L., and Anderson, J. Neural collaborative filtering vs. matrix factorization revisited. In Fourteenth ACM Conference on Recommender Systems (RecSys'20), pp. 240–248, 2020. ISBN 9781450375832.
-
-Shazeer, N. Glu variants improve transformer, 2020.Shin, K., Kwak, H., Kim, S. Y., Ramström, M. N., Jeong, J., Ha, J.-W., and Kim, K.-M. Scaling law for recommendation models: towards general-purpose user representations. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, AAAI'23/IAAI'23/EAAI'23. AAAI Press, 2023. ISBN 978-1-57735-880-0. doi:10.1609/aaai.v37i4.25582. URL https://doi.org/10.1609/aaai.v37i4.25582.
-
-Shrivastava, A. and Li, P. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Advances in Neural Information Processing Systems, volume 27, 2014.
-
-Sileo, D., Vossen, W., and Raymaekers, R. Zero-shot recommendation as language modeling. In Hagen, M., Verberne, S., Macdonald, C., Seifert, C., Balog, K., Nørvåg, K., and Setty, V. (eds.), Advances in Information Retrieval - 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10-14, 2022, Proceedings, Part II, volume 13186 of Lecture Notes in Computer Science, pp. 223–230. Springer, 2022. doi: 10.1007/978-3-030-99739-7\_26. URL https://doi.org/10.1007/978-3-030-99739-7\_26.
-
-Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., and Liu, Y. Roformer: Enhanced transformer with rotary position embedding, 2023.
-
-Sun, F., Liu, J., Wu, J., Pei, C., Lin, X., Ou, W., and Jiang, P. Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM '19, pp. 1441–1450, 2019. ISBN 9781450369763.
-
-Tang, H., Liu, J., Zhao, M., and Gong, X. Progressive layered extraction (ple): A novel multi-task learning (mtl) model for personalized recommendations. In Proceedings of the 14th ACM Conference on Recommender Systems, RecSys '20, pp. 269–278, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450375832. doi: 10.1145/3383313.3412236. URL https://doi.org/10.1145/3383313.3412236.
-
-Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. Llama: Open and efficient foundation language models, 2023a.
-
-Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
-
-Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. Llama 2: Open foundation and fine-tuned chat models, 2023b.
-
-Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS'17, pp. 6000–6010, 2017. ISBN 9781510860964.
-
-Wang, R., Shivanna, R., Cheng, D., Jain, S., Lin, D., Hong, L., and Chi, E. Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. In Proceedings of the Web Conference 2021, WWW '21, pp. 1785–1797, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383127. doi: 10.1145/3442381.3450078. URL https://doi.org/10.1145/3442381.3450078.
-
-Wang, Z., Zhao, L., Jiang, B., Zhou, G., Zhu, X., and Gai, K. Cold: Towards the next generation of pre-ranking system, 2020.
-
-Xia, X., Eksombatchai, P., Pancha, N., Badani, D. D., Wang, P.-W., Gu, N., Joshi, S. V., Farahpour, N., Zhang, Z., and Zhai, A. Transact: Transformer-based real-time user action model for recommendation at pinterest. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD '23, pp. 5249–5259, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701030. doi: 10.1145/3580305.3599918. URL https://doi.org/10.1145/3580305.3599918.
-
-Xiao, J., Ye, H., He, X., Zhang, H., Wu, F., and Chua, T.-S. Attentional factorization machines: Learning the weight of feature interactions via attention networks. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI'17, pp. 3119–3125. AAAI Press, 2017. ISBN 9780999241103.
-
-Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T.-Y. On layer normalization in the transformer architecture. In Proceedings of the 37th International Conference on Machine Learning, ICML'20. JMLR.org, 2020.Yang, J., Yi, X., Zhiyuan Cheng, D., Hong, L., Li, Y., Xiaoming Wang, S., Xu, T., and Chi, E. H. Mixed negative sampling for learning two-tower neural networks in recommendations. In Companion Proceedings of the Web Conference 2020, WWW '20, pp. 441–447, 2020. ISBN 9781450370240.
-
-Zhai, J., Lou, Y., and Gehrke, J. Atlas: A probabilistic algorithm for high dimensional similarity search. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, pp. 997–1008, 2011. ISBN 9781450306614.
-
-Zhai, J., Gong, Z., Wang, Y., Sun, X., Yan, Z., Li, F., and Liu, X. Revisiting neural retrieval on accelerators. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD '23, pp. 5520–5531, New York, NY, USA, 2023a. Association for Computing Machinery. ISBN 9798400701030. doi: 10.1145/3580305.3599897. URL https://doi.org/10.1145/3580305.3599897.
-
-Zhai, Y., Jiang, C., Wang, L., Jia, X., Zhang, S., Chen, Z., Liu, X., and Zhu, Y. Bytetransformer: A high-performance transformer boosted for variable-length inputs. In 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 344–355, Los Alamitos, CA, USA, May 2023b. IEEE Computer Society. doi: 10.1109/IPDPS54959.2023.00042. URL https://doi.ieeecomputersociety.org/10.1109/IPDPS54959.2023.00042.
-
-Zhang, B., Luo, L., Liu, X., Li, J., Chen, Z., Zhang, W., Wei, X., Hao, Y., Tsang, M., Wang, W., Liu, Y., Li, H., Badr, Y., Park, J., Yang, J., Mudigere, D., and Wen, E. Dhen: A deep and hierarchical ensemble network for large-scale click-through rate prediction, 2022.
-
-Zhao, X., Xia, L., Zhang, L., Ding, Z., Yin, D., and Tang, J. Deep reinforcement learning for page-wise recommendations. In Proceedings of the 12th ACM Conference on Recommender Systems, RecSys '18, pp. 95–103, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450359016. doi: 10.1145/3240323.3240374. URL https://doi.org/10.1145/3240323.3240374.
-
-Zhao, Z., Yang, Y., Wang, W., Liu, C., Shi, Y., Hu, W., Zhang, H., and Yang, S. Breaking the curse of quality saturation with user-centric ranking, 2023.
-
-Zhou, G., Zhu, X., Song, C., Fan, Y., Zhu, H., Ma, X., Yan, Y., Jin, J., Li, H., and Gai, K. Deep interest network for click-through rate prediction. KDD '18, 2018.
-
-Zhou, K., Wang, H., Zhao, W. X., Zhu, Y., Wang, S., Zhang, F., Wang, Z., and Wen, J.-R. S3-rec: Self-supervised learning for sequential recommendation with
-
-mutual information maximization. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, CIKM '20, pp. 1893–1902, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450368599. doi: 10.1145/3340531.3411954. URL https://doi.org/10.1145/3340531.3411954.
-
-Zhuo, J., Xu, Z., Dai, W., Zhu, H., Li, H., Xu, J., and Gai, K. Learning optimal tree models under beam search. In Proceedings of the 37th International Conference on Machine Learning, ICML'20. JMLR.org, 2020.### A. Notations
-
-We summarize key notations used in this paper in Table 8 and Table 9.
-
-
-| Symbol | Description |
| $ \Psi_{k}(t_{j}) $ | The k-th training example (k is ordered globally) emitted by the feature logging system at time $ t_{j} $. In a typical DLRM recommendation system, after the user consumes some content $ \Phi_{i} $ (by responding with an action $ a_{i} $ such as skip, video completion and share), the feature logging system joins the tuple $ (\Phi_{i}, a_{i}) $ with the features used to rank $ \Phi_{i} $, and emits $ (\Phi_{i}, a_{i}) $ features for $ \Phi_{i} $ as a training example $ \Psi_{k}(t_{j}) $. As discussed in Section 2.3, DLRMs and GRs deal with different numbers of training examples, with the number of examples in GRs typically being 1-2 orders of magnitude smaller. |
| $ n_{c}(n_{c,i}) $ | Number of contents that user has interacted with (of user/sample i). |
| $ \Phi_{0}, \dots, \Phi_{n_{c}-1} $ | List of contents that a user has interacted with, in the context of a recommendation system. List of user actions corresponding to $ \Phi_{i} $s. When all predicted events are binary, each action can be considered a multi-hot vector over (atomic) events such as like, share, comment, image view, video initialization, video completion, hide, etc. |
| $ a_{0}, \dots, a_{n_{c}-1} $ | List of user actions corresponding to the value of $ a_{0} $, the value of $ a_{1} $, the value of $ a_{2} $, the value of $ a_{3} $, the value of $ a_{4} $, the value of $ a_{5} $, the value of $ a_{6} $, the value of $ a_{7} $, the value of $ a_{8} $, the value of $ a_{9} $, the value of $ a_{10} $, the value of $ a_{11} $, the value of $ a_{12} $, the value of $ a_{13} $, the value of $ a_{14} $, the value of $ a_{15} $, the value of $ a_{16} $, the value of $ a_{17} $, the value of $ a_{18} $, the value of $ a_{19} $, the value of $ a_{20} $, the value of $ a_{21} $, the value of $ a_{22} $, the value of $ a_{23} $, the value of $ a_{24} $, the value of $ a_{25} $, the value of $ a_{26} $, the value of $ a_{27} $, the value of $ a_{28} $, the value of $ a_{29} $, the value of $ a_{30} $, the value of $ a_{31} $, the value of $ a_{32} $, the value of $ a_{33} $, the value of $ a_{34} $, the value of $ a_{35} $, the value of $ a_{36} $, the value of $ a_{37} $, the value of $ a_{38} $, the value of $ a_{39} $, the value of $ a_{40} $, the value of $ a_{41} $, the value of $ a_{42} $, the value of $ a_{43} $, the value of $ a_{44} $, the value of $ a_{45} $, the value of $ a_{46} $, the value of $ a_{47} $, the value of $ a_{48} $, the value of $ a_{49} $, the value of $ a_{50} $, the value of $ a_{51} $, the value of $ a_{52} $, the value of $ a_{53} $, the value of $ a_{54} $, the value of $ a_{55} $, the value of $ a_{56} $, the value of $ a_{57} $, the value of $ a_{58} $, the value of $ a_{59} $, the value of $ a_{60} $, the value of $ a_{61} $, the value of $ a_{62} $, the value of $ a_{63} $, the value of $ a_{64} $, the value of $ a_{65} $, the value of $ a_{66} $, the value of $ a_{67} $, the value of $ a_{68} $, the value of $ a_{69} $, the value of $ a_{70} $, the value of $ a_{71} $, the value of $ a_{72} $, the value of $ a_{73} $, the value of $ a_{74} $, the value of $ a_{75} $, the value of $ a_{76} $, the value of $ a_{77} $, the value of $ a_{78} $, the value of $ a_{79} $, the value of $ a_{80} $, the value of $ a_{81} $, the value of $ a_{82} $, the value of $ a_{83} $, the value of $ a_{84} $, the value of $ a_{85} $, the value of $ a_{86} $, the value of $ a_{87} $, the value of $ a_{88} $, the value of $ a_{89} $, the value of $ a_{90} $, the value of $ a_{91} $, the value of $ a_{92} $, the value of $ a_{93} $, the value of $ a_{94} $, the value of $ a_{95} $, the value of $ a_{96} $, the value of $ a_{97} $, the value of $ a_{98} $, the value of $ a_{99} $, the value of $ a_{10} $, the value of $ a_{11} $, the value of $ a_{12} $, the value of $ a_{13} $, the value of $ a_{14} $, the value of $ a_{15} $, the value of $ a_{16} $, the value of $ a_{17} $, the value of $ a_{18} $, the value of $ a_{19} $, the value of $ a_{20} $, the value of $ a_{21} $, the value of $ a_{22} $, the value of $ a_{23} $, the value of $ a_{24} $, the value of $ a_{25} $, the value of $ a_{26} $, the value of $ a_{27} $, the value of $ a_{28} $, the value of $ a_{29} $, the value of $ a_{30} $, the value of $ a_{31} $, the value of $ a_{32} $, the value of $ a_{33} $, the value of $ a_{34} $, the value of $ a_{35} $, the value of $ a_{36} $, the value of $ a_{37} $, the value of $ a_{38} $, the value of $ a_{39} $, the value of $ a_{40} $, the value of $ a_{41} $, the value of $ a_{42} $, the value of $ a_{43} $, the value of $ a_{44} $, the value of $ a_{45} $, the value of $ a_{46} $, the value of $ a_{47} $, the value of $ a_{48} $, the value of $ a_{49} $, the value of $ a_{50} $, the value of $ a_{51} $, the value of $ a_{52} $, the value of $ a_{53} $, the value of $ a_{54} $, the value of $ a_{55} $, the value of $ a_{56} $, the value of $ a_{57} $, the value of $ a_{58} $, the value of $ a_{59} $, the value of $ a_{60} $, the value of $ a_{61} $, the value of $ a_{62} $, the value of $ a_{63} $, the value of $ a_{64} $, the value of $ a_{65} $, the value of $ a_{66} $, the value of $ a_{67} $, the value of $ a_{68} $, the value of $ a_{69} $, the value of $ a_{70} $, the value of $ a_{71} $, the value of $ a_{72} $, the value of $ a_{73} $, the value of $ a_{74} $, the value of $ a_{75} $, the value of $ a_{76} $, the value of $ a_{77} $, the value of $ a_{78} $, the value of $ a_{79} $, the value of $ a_{80} $, the value of $ a_{81} $, the value of $ a_{82} $, the value of $ a_{83} $, the value of $ a_{84} $, the value of $ a_{85} $, the value of $ a_{86} $, the value of $ a_{87} $, the value of $ a_{88} $, the value of $ a_{89} $, the value of $ a_{90} $, the value of $ a_{91} $, the value of $ a_{92} $, the value of $ a_{93} $, the value of $ a_{94} $, the value of $ a_{95} $, the value of $ a_{96} $, the value of $ a_{97} $, the value of $ a_{98} $, the value of $ a_{99} $, the value of $ a_{10} $, the value of $ a_{11} $, the value of $ a_{12} $, the value of $ a_{13} $, the value of $ a_{14} $, the value of $ a_{15} $, the value of $ a_{16} $, the value of $ a_{17} $, the value of $ a_{18} $, the value of $ a_{19} $, the value of $ a_{20} $, the value of $ a_{21} $, the value of $ a_{22} $, the value of $ a_{23} $, the value of $ a_{24} $, the value of $ a_{25} $, the value of $ a_{26} $, the value of $ a_{27} $, the value of $ a_{28} $, the value of $ a_{29} $, the value of $ a_{30} $, the value of $ a_{31} $, the value of $ a_{32} $, the value of $ a_{33} $, the value of $ a_{34} $, the value of $ a_{35} $, the value of $ a_{36} $, the value of $ a_{37} $, the value of $ a_{38} $, the value of $ a_{39} $, the value of $ a_{40} $, the value of $ a_{41} $, the value of $ a_{42} $, the value of $ a_{43} $, the value of $ a_{44} $, the value of $ a_{45} $, the value of $ a_{46} $, the value of $ a_{47} $, the value of $ a_{48} $, the value of $ a_{49} $, the value of $ a_{50} $, the value of $ a_{51} $, the value of $ a_{52} $, the value of $ a_{53} $, the value of $ a_{54} $, the value of $ a_{55} $, the value of $ a_{56} $, the value of $ a_{57} $, the value of $ a_{58} $, the value of $ a_{59} $, the value of $ a_{60} $, the value of $ a_{61} $, the value of $ a_{62} $, the value of $ a_{63} $, the value of $ a_{64} $, the value of $ a_{65} $, the value of $ a_{66} $, the value of $ a_{67} $, the value of $ a_{68} $, the value of $ a_{69} $, the value of $ a_{70} $, the value of $ a_{71} $, the value of $ a_{72} $, the value of $ a_{73} $, the value of $ a_{74} $, the value of $ a_{75} $, the value of $ a_{76} $, the value of $ a_{77} $, the value of $ a_{78} $, the value of $ a_{79} $, the value of $ a_{80} $, the value of $ a_{81} $, the value of $ a_{82} $, the value of $ a_{83} $, the value of $ a_{84} $, the value of $ a_{85} $, the value of $ a_{86} $, the value of $ a_{87} $, the value of $ a_{88} $, the value of $ a_{89} $, the value of $ a_{90} $, the value of $ a_{91} $, the value of $ a_{92} $, the value of $ a_{93} $, the value of $ a_{94} $, the value of $ a_{95} $, the value of $ a_{96} $, the value of $ a_{97} $, the value of $ a_{98} $, the value of $ a_{99} $, the value of $ a_{10} $, the value of $ a_{11} $, the value of $ a_{12} $, the value of $ a_{13} $, the value of $ a_{14} $, the value of $ a_{15} $, the value of $ a_{16} $, the value of $ a_{17} $, the value of $ a_{18} $, the value of $ a_{19} $, the value of $ a_{20} $, the value of $ a_{21} $, the value of $ a_{22} $, the value of $ a_{23} $, the value of $ a_{24} $, the value of $ a_{25} $, the value of $ a_{26} $, the value of $ a_{27} $, the value of $ a_{28} $, the value of $ a_{29} $, the value of $ a_{30} $, the value of $ a_{31} $, the value of $ a_{32} $, the value of $ a_{33} $, the value of $ a_{34} $, the value of $ a_{35} $, the value of $ a_{36} $, the value of $ a_{37} $, the value of $ a_{38} $, the value of $ a_{39} $, the value of $ a_{40} $, the value of $ a_{41} $, the value of $ a_{42} $, the value of $ a_{43} $, the value of $ a_{44} $, the value of $ a_{45} $, the value of $ a_{46} $, the value of $ a_{47} $, the value of $ a_{48} $, the value of $ a_{49} $, the value of $ a_{50} $, the value of $ a_{51} $, the value of $ a_{52} $, the value of $ a_{53} $, the value of $ a_{54} $, the value of $ a_{55} $, the value of $ a_{56} $, the value of $ a_{57} $, the value of $ a_{58} $, the value of $ a_{59} $, the value of $ a_{60} $, the value of $ a_{61} $, the value of $ a_{62} $, the value of $ a_{63} $, the value of $ a_{64} $, the value of $ a_{65} $, the value of $ a_{66} $, the value of $ a_{67} $, the value of $ a_{68} $, the value of $ a_{69} $, the value of $ a_{70} $, the value of $ a_{71} $, the value of $ a_{72} $, the value of $ a_{73} $, the value of $ a_{74} $, the value of $ a_{75} $, the value of $ a_{76} $, the value of $ a_{77} $, the value of $ a_{78} $, the value of $ a_{79} $, the value of $ a_{80} $, the value of $ a_{81} $, the value of $ a_{82} $, the value of $ a_{83} $, the value of $ a_{84} $, the value of $ a_{85} $, the value of $ a_{86} $, the value of $ a_{87} $, the value of $ a_{88} $, the value of $ a_{89} $, the value of $ a_{90} $, the value of $ a_{91} $, the value of $ a_{92} $, the value of $ a_{93} $, the value of $ a_{94} $, the value of $ a_{95} $, the value of $ a_{96} $, the value of $ a_{97} $, the value of $ a_{98} $, the value of $ a_{99} $, the value of $ a_{10} $, the value of $ a_{11} $, the value of $ a_{12} $, the value of $ a_{13} $, the value of $ a_{14} $, the value |
-
-Table 8. Table of Notations (continued on the next page).
-
-
-### B. Generative Recommenders: Background and Formulations
-
-Many readers are likely more familiar with classical Deep Learning Recommendation Models (DLRMs) (Mudigere et al., 2022) given its popularity from YouTube DNN days (Covington et al., 2016) and its widespread usage in every single large online content and e-commerce platform (Cheng et al., 2016; Zhou et al., 2018; Wang et al., 2021; Chang et al., 2023; Xia et al., 2023; Zhai et al., 2023a). DLRMs operate on top of heterogeneous feature spaces using various neuralActions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
-
-
-
-| Symbol | Description |
| X | Input to an HSTU layer. In standard terminology (before batching), $ X \in \mathbb{R}^{N \times d} $ assuming we have a input sequence containing N tokens. |
| $ Q(X) $, $ K(X) $, $ V(X) $ | Query, key, value in HSTU obtained for a given input X based on Equation (1). The definition is similar to Q, K, and V in standard Transformers. $ Q(X) $, $ K(X) \in \mathbb{R}^{h \times N \times d_{qk}} $, and $ V(X) \in \mathbb{R}^{h \times N \times d_v} $. |
| $ U(X) $ | HSTU uses $ U(X) $ to “gate” attention-pooled values ( $ V(X) $) in Equation (3), which together with $ f_2(\cdot) $, enables HSTU to avoid feedforward layers altogether. $ U(X) \in \mathbb{R}^{h \times N \times d_v} $. |
| $ A(X) $ | Attention tensor obtained for input X. $ A(X) \in \mathbb{R}^{h \times N \times N} $. |
| $ Y(X) $ | Output of a HSTU layer obtained for the input X. $ Y(X) \in \mathbb{R}^{d} $. |
| Split( $ \cdot $) | The operation that splits a tensor into chunks. $ \phi_1(f_1(X)) \in \mathbb{R}^{N \times (2hd_{qk} + 2hd_v)} $ in Equation (1); we obtain $ U(X) $, $ V(X) $ (both of shape $ h \times N \times d_v $), $ Q(X) $, $ K(X) $ (both of shape $ h \times N \times d_{qk} $) by splitting the larger tensor (and permitting dimensions) with $ U(X) $, $ V(X) $, $ Q(X) $, $ K(X) = \text{Split}(\phi_1(f_1(X))) $. |
| $ \text{rab}^{p,t} $ | relative attention bias that incorporates both positional (Raffel et al., 2020) and temporal information (based on the time when the tokens are observed, $ t_0, \ldots, t_{n-1} $; one possible implementation is to apply some bucketization function to $ (t_j - t_i) $ for $ (i, j) $). In practice, we share $ \text{rab}^{p,t} $ across different attention heads within a layer, hence $ \text{rab}^{p,t} \in \mathbb{R}^{1 \times N \times N} $. |
| $ \alpha $ | Parameter controlling sparsity in the Stochastic Length algorithm used in HSTU (Section 3.2). |
| $ R $ | Register size on GPUs, in the context of the HSTU algorithm discussed in Section 3.2. |
| m | Number of candidates considered in a recommendation system's ranking stage. |
| $ b_m $ | Microbatch size, in the M-FALCON algorithm discussed in Section 3.4. |
-
-Table 9. Table of Notations (continued)
-
-
-networks including feature interaction modules (Guo et al., 2017; Xiao et al., 2017; Wang et al., 2021), sequential pooling or target-aware pairwise attention modules (Hidasi et al., 2016; Zhou et al., 2018; Chang et al., 2023) and advanced multi-expert multi-task modules (Ma et al., 2018; Tang et al., 2020). We hence provided an overview of Generative Recommenders (GRs) by contrasting them with classical DLRMs explicitly in Section 2 and Section 3. In this section, we give the readers an alternative perspective starting from the classical sequential recommender literature.
-
-#### B.1. Background: Sequential Recommendations in Academia and Industry
-
-##### B.1.1. ACADEMIC RESEARCH (TRADITIONAL SEQUENTIAL RECOMMENDER SETTINGS)
-
-Recurrent neural networks (RNNs) were first applied to recommendation scenarios in GRU4Rec (Hidasi et al., 2016). Hidasi et al. (2016) considered Gated Recurrent Units (GRUs) and applied them over two datasets, RecSys Challenge 2015 $ ^{2} $ and VIDEO (a proprietary dataset). In both cases, only positive events (clicked e-commerce items or videos where users spent at least a certain amount of time watching) were kept as part of the input sequence. We further observe that in a classical industrial-scale two-stage recommendation system setup consisting of retrieval and ranking stages (Covington et al., 2016), the task that Hidasi et al. (2016) solved primarily maps to the retrieval task.
-
-Transformers, sequential transduction architectures, and their variants. Advances in sequential transduction architectures in later years, in particular Transformers (Vaswani et al., 2017), have motivated similar advancements in recommendation systems. SASRec (Kang & McAuley, 2018) first applied Transformers in an autoregressive setting. They considered the presence of a review or rating as positive feedback, thereby converting classical datasets like Amazon Reviews $ ^3 $ and MovieLens $ ^4 $ to sequences of positive items, similar to GRU4Rec. A binary cross entropy loss was employed, where positive target is defined as the next “positive” item (recall this is in essence just presence of a review or rating), and negative target is randomly sampled from the item corpus $ \mathbb{X} = \mathbb{X}_c $.Most subsequent research were built upon similar settings as GRU4Rec (Hidasi et al., 2016) and SASRec (Kang & McAuley, 2018) discussed above, such as BERT4Rec (Sun et al., 2019) applying bidirectional encoder setting from BERT (Devlin et al., 2019). S3Rec (Zhou et al., 2020) introducing an explicit pre-training stage, and so on.
-
-##### B.1.2. INDUSTRIAL APPLICATIONS AS PART OF DEEP LEARNING RECOMMENDATION MODELS (DLRMS)
-
-Sequential approaches, including sequential encoders and pairwise attention modules, have been widely applied in industrial settings due to their ability to enhance user representations as part of DLRMs. DLRMs commonly use relatively small sequence lengths, such as 20 in BST (Chen et al., 2019), 1,000 in DIN (Zhou et al., 2018), and 100 in TransAct (Xia et al., 2023). We observe that these are 1-3 orders of magnitude smaller compared with 8,192 in this work (Section 4.3).
-
-Despite using short sequence lengths, most DLRMs can successfully capture long-term user preferences. This can be attributed to two key aspects. First, precomputed user profiles/embeddings (Xia et al., 2023) or external vector stores (Chang et al., 2023) are commonly used in modern DLRMs, both of which effectively extend lookback windows. Second, a significant number of contextual-, user-, and item-side features were generally employed (Zhou et al., 2018; Chen et al., 2019; Chang et al., 2023; Xia et al., 2023) and various heterogeneous networks, such as FMs (Xiao et al., 2017; Guo et al., 2017), DCNs (Wang et al., 2021), MoEs, etc. are used to transform representations and combine outputs.
-
-In contrast to sequential settings discussed in Appendix B.1.1, all major industrial work defines loss over (user/request, candidate item) pairs. In the ranking setting, a multi-task binary cross-entropy loss is commonly used. In the retrieval setting, two tower setting (Covington et al., 2016) remains the dominant approach. Recent work has investigated representing the next item to recommend as a probability distribution over a sequence of (sub-)tokens, such as OTM (Zhuo et al., 2020), and DR (Gao et al., 2021) (note that in other recent work, the same setting is sometimes denoted as “generative retrieval”). They commonly utilize beam search to decode the item from sub-tokens. Advanced learned similarity functions, such as mixture-of-logits (Zhai et al., 2023a), have also been proposed and deployed as an alternative to two-tower setting and beam search given proliferation of modern accelerators such as GPUs, custom ASICs, and TPUs.
-
-From a problem formulation perspective, we consider all work discussed above part of DLRMs (Mudigere et al., 2022) given the model architectures, features used, and losses used differ significantly from academic sequential recommender research discussed in Appendix B.1.1. It's also worth remarking that there have been no successful applications of fully sequential ranking settings in industry, especially not at billion daily active users (DAU) scale, prior to this work.
-
-#### B.2. Formulations: Ranking and Retrieval as Sequential Transduction Tasks in Generative Recommenders (GRs)
-
-We next discuss three limitations in the traditional sequential recommender settings and DLRM settings, and how Generative Recommenders (GRs) address them from a problem formulation perspective.
-
-Ignorance of features other than user-interacted items. Past sequential formulations only consider contents (items) users explicitly interacted with (Hidasi et al., 2016; Kang & McAuley, 2018; Sun et al., 2019; Zhou et al., 2020), while industry-scale recommendation systems prior to GRs are trained over a vast number of features to enhance the representation of users and contents (Covington et al., 2016; Cheng et al., 2016; Zhou et al., 2018; Chen et al., 2019; Chang et al., 2023; Xia et al., 2023; Zhai et al., 2023a). GR addresses this limitation by a) compressing other categorical features and merging them with the main time series, and b) capturing numerical features through cross-attention interaction utilizing a target-aware formulation as discussed in Section 2.1 and Figure 2. We validate this by showing that the traditional “interaction-only” formulation that ignores such features degrades model quality significantly; experiment results can be found in the rows labeled “GR (interactions only)” in Table 7 and Table 6, where we show utilizing only interaction history led to a 1.3% decrease in hit rate@100 for retrieval and a 2.6% NE decrease in ranking (recall a 0.1% change in NE is significant, as discussed in Sections 4.1.2 and 4.3.1).
-
-User representations are computed in a target-independent setting. A second issue is most traditional sequential recommenders, including GRU4Rec (Hidasi et al., 2016), SASRec (Kang & McAuley, 2018), BERT4Rec (Sun et al., 2019), S3Rec (Zhou et al., 2020), etc. are formulated in a target-independent fashion where for a target item $ \Phi_i, \Phi_0, \Phi_1, \ldots, \Phi_{i-1} $ are used as encoder input to compute user representations, which is then used to provide predictions. In contrast, most major DLRM approaches used in industrial settings formulated the sequential modules used in a target-aware fashion, with the ability to incorporate “target” (ranking candidate) information into the user representations. These include DIN (Zhou et al., 2018) (Alibaba), BST (Chen et al., 2019) (Alibaba), TWIN (Chang et al., 2023) (Kwai), and TransAct (Xia et al., 2023) (Pinterest).Generative Recommenders (GRs) combines the best of both worlds by interleaving the content and action sequences (Section 2.2) to enable applying target-aware attention in causal, autoregressive settings. We categorize and contrast prior work and this work in Table 10 $ ^{5} $.
-
-
-
- | Input for target item $ i $ | Expected output for target item $ i $ | Architecture | Training Procedure |
| GRs | $ \Phi_0, a_0, \Phi_1, a_1, ..., \Phi_i $ | $ a_i $ (target-aware) | Self-attention (HSTU) | Causal autoregressive (streaming/single-pass) |
| GRU4Rec\nSASRec | $ \Phi_0, \Phi_1, ..., \Phi_{i-1} $ | $ \Phi_i $ | RNNs (GRUs)\nSelf-attention (Transformers) | Causal autoregressive (multi-pass) |
| BERT4Rec\nS3Rec | $ \Phi_0, \Phi_1, ..., \Phi_{i-1} $\n(at inference time) | $ \Phi_i $ | Self-attention (Transformers) | Sequential multi-pass $ ^6 $ |
| DIN\nBST\nTWIN\nTransAct | $ \Phi_0, \Phi_1, ..., \Phi_i $\n $ (\Phi_0, a_0), ..., (\Phi_{i-1}, a_{i-1}), \Phi_i $ | $ a_i $ (target aware, implicitly as part of DLRMs) | Pairwise attention\nSelf-attention (Transformers)\nTwo-stage pairwise attention\nSelf-attention (Transformers) | Pointwise (generally streaming/single pass) |
-
-Table 10. Comparison of prior work on sequential recommenders and GRs, in the ranking setting, with DLRMs included for completeness.
-
-Discriminative formulations restrict applicability of prior sequential recommender work to pointwise settings. Finally, traditional sequential recommenders are discriminative by design. Existing sequential recommender literature, including seminal work such as GRU4Rec and SASRec, model $ p(\Phi_i|\Phi_0, a_0, \ldots, \Phi_{i-1}, a_{i-1}) $, or the conditional distribution of the next item to recommend given users' current states. On the other hand, we observe that there are two probabilistic processes in standard recommendation systems, namely the process of the recommendation system suggesting a content $ \Phi_i $ (e.g., some photo or video) to the user, and the process of the user reacting to the suggested content $ \Phi_i $ via some action $ a_i $ (which can be a combination of like, video completion, skip, etc.).
-
-A generative approach needs to model the joint distribution over the sequence of suggested contents and user actions, or $p(\Phi_{0}, a_{0}, \Phi_{1}, a_{1}, \ldots, \Phi_{n_{c}-1}, a_{n_{c}-1})$, as discussed in Section 2.2. Our proposal of Generative Recommenders enables modeling of such distributions, as shown in Table 11 (Figure 8). Note that the next action token $(a_{i})$ prediction task is exactly the GR ranking setting discussed in Table 1, whereas the next content $(\Phi_{i})$ prediction task is similar to the retrieval setting adapted to the interleaved setting, with the target changed in order to learn the input data distribution.
-
-
-| Task | Specification (Inputs / Outputs / Length) |
| $ x_{i}s $ | $ \Phi_{0}, a_{0}, \Phi_{1}, a_{1}, ..., \Phi_{n_{c}-2}, a_{n_{c}-2}, \Phi_{n_{c}-1}, a_{n_{c}-1} $ |
| $ y_{i}s $ | $ a_{0}, \varnothing, a_{1}, \varnothing, ..., a_{n_{c}-2}, \varnothing, a_{n_{c}-1}, \varnothing $ |
| $ n $ | $ 2n_{c} $ |
| $ x_{i}s $ | $ \Phi_{0}, a_{0}, \Phi_{1}, a_{1}, ..., \Phi_{n_{c}-2}, a_{n_{c}-2}, \Phi_{n_{c}-1}, a_{n_{c}-1} $ |
| $ y_{i}s $ | $ \varnothing, \Phi_{1}, \varnothing, \Phi_{2}, ..., \varnothing, \Phi_{n_{c}-1}, \varnothing, \varnothing $ |
| $ n $ | $ 2n_{c} $ |
-
-Table 11. Generative modeling over $ p(\Phi_0, a_0, \ldots, \Phi_{n_c-1}, a_{n_c-1}) $. An illustration is provided in Figure 8.
-
-
-Importantly, this formulation not only enables proper modeling of data distribution but further enables sampling sequences of items to recommend to the user directly via e.g., beam search. We hypothesize that this will lead to a superior approach compared with traditional listwise settings (e.g., DPP (Gillenwater et al., 2014) and RL (Zhao et al., 2018)), and we leave the full formulation and evaluation of such systems (briefly discussed in Section 6) as a future work.
-
-### C. Evaluation: Synthetic Data
-
-As previously discussed in Section 3.1, standard softmax attention, due to its normalization factor, makes it challenging to capture intensity of user preferences which is important for user representation learning. This aspect is important in recommendation scenarios as the system may need to predict the intensity of engagements (e.g., number of future positive responses).
-
-
-Figure 8. Comparison of traditional sequential recommenders (left) and Generative Recommenders (right). We illustrate sequential recommenders in causal autoregressive settings and GRs without contextual features to facilitate comparison. On the left hand side, the action types $ a_{i} $s are either ignored or combined with item information $ \Phi_{i} $s using MLPs, before going into self-attention blocks.
-
-
-actions on a particular topic) in addition to the relative ordering of items.
-
-To understand this behavior, we construct synthetic data following a Dirichlet Process that generates streaming data over a dynamic set of vocabulary. Dirichlet Process captures the behavior that ‘rich gets richer’ in user engagement histories. We set up the synthetic experiment as follows:
-
-• We randomly assign each one of 20,000 item ids to exactly one of 100 categories.
-
-• We generate 1,000,000 records of length 128 each, with the first 90% being used for training and the final 10% used for testing. To simulate the streaming training setting, we make the initial 40% of item ids available initially and the rest available progressively at equal intervals; i.e., at record 500,000, the maximum id that can be sampled is $ (40\% + 60\% \times 0.5) \times 20,000 = 14,000 $.
-
-• We randomly select up to 5 categories out of 100 for each record and randomly sample a prior $ H_{c} $ over these 5 categories. We sequentially sample category for each position following a Dirichlet process over possible categories as follows:
-
-- for n > 1:
-
- $ ^{*} $ with probability $ \alpha/(\alpha+n-1) $, draw category c from $ H_{c} $.
-
-* with probability $ n_{c}/(\alpha + n - 1) $, draw category c, where $ n_{c} $ is the number of previous items with category c.
-
- $ ^{*} $ randomly sample an assigned item matching category c subject to streaming constraints.
-
-where $ \alpha $ is uniformly sampled at random from (1.0, 500.0).
-
-The results can be found in Table 2. We always ablate $ rab^{p,t} $ for HSTU as this dataset does not have timestamps. We observe HSTU increasing Hit Rate@10 by more than 100% relative to standard Transformers. Importantly, replacing HSTU's pointwise attention mechanism with softmax ("HSTU w/ Softmax") also leads to a significant reduction in hit rate, verifying the importance of pointwise attention-like aggregation mechanisms.
-
-### D. Evaluation: Traditional Sequential Recommender Settings
-
-Our evaluations in Section 4.1.1 focused on comparing HSTU with a state-of-the-art Transformer baseline, SASRec, utilizing latest training recipe. In this section, we further consider two other alternative approaches.
-
-Recurrent neural networks (RNNs). We consider the classical work on sequential recommender, GRU4Rec (Hidasi et al., 2016), to help readers understand how self-attention models, including Transformers and HSTU, compare to traditional RNNs, when all the latest modeling and training improvements are fully incorporated.
-
-Self-supervised sequential approaches. We consider the most popular work, BERT4Rec (Sun et al., 2019), to understand how bidirectional self-supervision (leveraged in BERT4Rec via a Cloze objective) compares with unidirectional causal autoregressive settings, such as SASRec and HSTU.Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
-
-
-
- | Method | HR@10 | HR@50 | HR@200 | NDCG@10 | NDCG@200 |
| ML-1M | SASRec (2023) | .2853 | .5474 | .7528 | .1603 | .2498 |
| BERT4Rec | .2843 (-0.4%) | - | - | .1537 (-4.1%) | - |
| GRU4Rec | .2811 (-1.5%) | - | - | .1648 (+2.8%) | - |
| HSTU | .3097 (+8.6%) | .5754 (+5.1%) | .7716 (+2.5%) | .1720 (+7.3%) | .2606 (+4.3%) |
| HSTU-large | .3294 (+15.5%) | .5935 (+8.4%) | .7839 (+4.1%) | .1893 (+18.1%) | .2771 (+10.9%) |
| ML-20M | SASRec (2023) | .2906 | .5499 | .7655 | .1621 | .2521 |
| BERT4Rec | .2816 (-3.4%) | - | - | .1703 (+5.1%) | - |
| GRU4Rec | .2813 (-3.2%) | - | - | .1730 (+6.7%) | - |
| HSTU | .3252 (+11.9%) | .5885 (+7.0%) | .7943 (+3.8%) | .1878 (+15.9%) | .2774 (+10.0%) |
| HSTU-large | .3567 (+22.8%) | .6149 (+11.8%) | .8076 (+5.5%) | .2106 (+30.0%) | .2971 (+17.9%) |
| Books | SASRec (2023) | .0292 | .0729 | .1400 | .0156 | .0350 |
| HSTU | .0404 (+38.4%) | .0943 (+29.5%) | .1710 (+22.1%) | .0219 (+40.6%) | .0450 (+28.6%) |
| HSTU-large | .0469 (+60.6%) | .1066 (+46.2%) | .1876 (+33.9%) | .0257 (+65.8%) | .0508 (+45.1%) |
-
-Table 12. Evaluations of methods on public datasets in traditional sequential recommender settings (multi-pass, full-shuffle). Compared with Table 4, two other baselines (GRU4Rec and BERT4Rec) are included for completeness.
-
-
-Results are presented in Table 12. We reuse BERT4Rec results and GRU4Rec results on ML-1M and ML-20M as reported by Klenitskiy & Vasilev (2023). Given a sampled softmax loss is used, we hold the number of negatives used constant (128 for ML-1M, ML-20M and 512 for Amazon Books) to ensure a fair comparison between methods.
-
-The results confirm that SASRec remains one of the most competitive approaches in traditional sequential recommendation settings when sampled softmax loss is used (Zhai et al., 2023a; Klenitskiy & Vasilev, 2023), while HSTU significantly outperforms evaluated transformers, RNNs, and self-supervised bidirectional transformers.
-
-### E. Evaluation: Traditional DLRM Baselines
-
-The DLRM baseline configurations used in Section 4 reflect continued iterations of hundreds of researchers and engineers over multiple years and a close approximation of production configurations on a large internet platform with billions of daily active users before HSTUs/GRs were deployed. We give a high level description of the models used below.
-
-Ranking Setting. The baseline ranking model, as described in (Mudigere et al., 2022), employs approximately one thousand dense features and fifty sparse features. We incorporated various modeling techniques such as Mixture of Experts (Ma et al., 2018), variants of Deep & Cross Network (Wang et al., 2021), various sequential recommendation modules including target-aware pairwise attention (one commonly used variant in industrial settings can be found in (Zhou et al., 2018)), and residual connection over special interaction layers (He et al., 2015; Zhang et al., 2022). For the low FLOPs regime in the scaling law section (Section 4.3.1), some modules with high computational costs were simplified and/or replaced with other state-of-the-art variants like DCNs to achieve desired FLOPs.
-
-While we cannot disclose the exact settings due to confidentiality considerations, to the best of our knowledge, our baseline represents one of the best known DLRM approaches when recent research are fully incorporated. To validate this claim and to facilitate readers' understanding, we report a typical setup based on identical features but only utilizing major published results including DIN (Zhou et al., 2018), DCN (Wang et al., 2021), and MMoE (Ma et al., 2018) ("DLRM (DIN+DCN)") in Table 7, with the combined architecture illustrated in Figure 9. This setup significantly underperformed our production DLRM setup by 0.71% in NE for the main E-Task and 0.57% in NE for the main C-Task (where 0.1% NE is significant).
-
-Retrieval Setting. The baseline retrieval model employs a standard two-tower neutral retrieval setting (Covington et al., 2016) with mixed in-batch and out-of-batch sampling. The input feature set consists of both high cardinality sparse features (e.g., item ids, user ids) and low cardinality sparse features (e.g. languages, topics, interest entities). A stack of feed forward layers with residual connections (He et al., 2015) is used to compress the input features into user and item embeddings.
-
-Features and Sequence Length. The features used in both of the DLRM baselines, including main user interaction history that is utilized by various sequential encoder/pairwise attention modules, are strict supersets of the features used in all GR candidates. This applies to all studies conducted in this paper, including those used in the scaling studies (Section 4.3.1).
-
-
-Figure 9. A high level architecture of a baseline DLRM ranking model ("DLRM (DIN+DCN)" in Table 7) that utilizes major published work including DIN (Zhou et al., 2018), DCN (Wang et al., 2021), and MMoE (Ma et al., 2018).
-
-
-
-| Metric Name | Selection Type |
| Greedy | Weighted | Random |
| Main Engagement Metric (NE) | 0.495 | 0.494 | 0.495 |
| Main Consumption Metric (NE) | 0.792 | 0.789 | 0.791 |
-
-Table 13. Comparison of subsequence selection methods for Stochastic Length on model quality, measured by Normalized Entropy (NE).
-
-
-### F. Stochastic Length
-
-#### F.1. Subsequence Selection
-
-In Equation (4), we select a subsequence of length L from the full user history in order to increase sparsity. Our empirical results indicate that careful design of the subsequence selection technique can improve model quality. We compute a metric $ f_{i}=t_{n}-t_{i} $ which corresponds to the amount of time elapsed since the user interacted with item $ x_{i} $. We conduct offline experiments with the following subsequence selection methods:
-
-• Greedy Selection – Selects L items with smallest values of $ f_{i} $ from S
-
-• Random Selection – Selects L items from S randomly
-
-• Feature-Weighted Selection – Selects L items from S according to a weighted distribution $ 1 - f_{n,i}/(\sum_{j=1}^{L}f_{j,i}) $
-
-During our offline experiments, the feature-weighted subsequence selection method resulted in the best model quality, as shown in Table 13.
-
-#### F.2. Impact of Stochastic Length on Sequence Sparsity
-
-In Table 3, we show the impact of Stochastic Length on sequence sparsity for a representative industry-scale configuration with 30-day user engagement history. The sequence sparsity is defined as one minus the ratio of the average sequence length of all samples divided by the maximum sequence length. To better characterize the computational cost of sparse attentions, we also define $ s_{2} $, which is defined as one minus the sparsity of the attention matrix. For reference, we present the results for 60-day and 90-day user engagement history in Table 14 and Table 15, respectively.
-| Alpha | Max Sequence Length |
| 1,024 | 2,048 | 4,096 | 8,192 |
| sparsity | s2 | sparsity | s2 | sparsity | s2 | sparsity | s2 |
| 1.6 | 71.5% | 89.4% | 75.8% | 92.3% | 79.4% | 94.7% | 83.8% | 97.3% |
| 1.7 | 57.3% | 77.6% | 60.6% | 79.8% | 67.3% | 86.6% | 74.5% | 93.3% |
| 1.8 | 37.5% | 56.2% | 42.6% | 62.1% | 51.9% | 74.2% | 62.6% | 85.5% |
| 1.9 | 15.0% | 25.2% | 17.7% | 29.0% | 29.6% | 47.5% | 57.8% | 80.9% |
| 2.0 | 1.2% | 1.7% | 2.5% | 3.5% | 18.9% | 30.8% | 57.6% | 80.6% |
-
-Table 14. Impact of Stochastic Length (SL) on sequence sparsity, over a 60d user engagement history.
-
-
-
-| Alpha | Max Sequence Length |
| 1,024 | 2,048 | 4,096 | 8,192 |
| sparsity | s2 | sparsity | s2 | sparsity | s2 | sparsity | s2 |
| 1.6 | 68.0% | 85.0% | 74.6% | 90.8% | 78.6% | 93.5% | 83.5% | 97.3% |
| 1.7 | 56.3% | 76.1% | 61.2% | 80.6% | 67.5% | 87.0% | 74.3% | 93.3% |
| 1.8 | 38.9% | 58.3% | 42.0% | 61.3% | 50.4% | 72.4% | 61.0% | 84.4% |
| 1.9 | 16.2% | 27.3% | 17.3% | 28.6% | 27.2% | 44.4% | 54.3% | 77.8% |
| 2.0 | 0.9% | 1.2% | 1.6% | 2.1% | 13.5% | 22.5% | 54.0% | 77.4% |
-
-Table 15. Impact of Stochastic Length (SL) on sequence sparsity, over a 90d user engagement history.
-
-
-#### F.3. Comparisons Against Sequence Length Extrapolation Techniques
-
-We conduct additional studies to verify that Stochastic Length is competitive against existing techniques for sequence length extrapolation used in language modeling. Many existing methods perform sequence length extrapolation through modifications of RoPE (Su et al., 2023). To compare against existing methods, we train an HSTU variant (HSTU-RoPE) with no relative attention bias and rotary embeddings.
-
-We evaluate the following sequence length extrapolation methods on HSTU-RoPE:
-
-• Zero-Shot - Apply NTK-Aware RoPE (Peng et al., 2024) before directly evaluating the model with no finetuning;
-
-• Fine-tune - Finetune the model for 1000 steps after applying NTK-by-parts (Peng et al., 2024).
-
-We evaluate the following sequence length extrapolation methods on HSTU (includes relative attention bias, no rotary embeddings):
-
-- Zero-Shot - Clamp the relative position bias according to the maximum training sequence length, directly evaluate the model (Raffel et al., 2020; Press et al., 2022);
-
-Fine-tune - Clamp the relative position bias according to the maximum training sequence length, fine-tune the model for 1000 steps before evaluating the model.
-
-
-
-
-
-
-
-
-
-
-
-
-
-Figure 10. Impact of Stochastic Length (SL) on ranking model metrics. Left to right: n = [1024, 2048, 4096, 8192] (n is after interleaving algorithm as discussed in Section 2.2 to enable target-aware cross attention in causal-masked settings).
-Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
-
-
-
-| Evaluation Strategy | Average NE Difference vs Full Sequence Baseline |
| Model Type | 2048 / 52% Sparsity | 4096 / 75% Sparsity |
| Zero-shot | HSTU (Raffel et al., 2020) | 6.46% | 10.35% |
| HSTU-RoPE (Peng et al., 2024) | 7.51% | 11.27% |
| Fine-tune | HSTU (Raffel et al., 2020) | 1.92% | 2.21% |
| HSTU-RoPE (Peng et al., 2024) | 1.61% | 2.19% |
| Stochastic Length (SL) | HSTU | 0.098% | 0.64% |
-
-Table 16. Comparisons of Stochastic Length (SL) vs existing Length Extrapolation methods.
-
-
-In Table 16, we report the NE difference between models with induced data sparsity during training (Stochastic Length, zero-shot, fine-tuning) and models trained on the full data. We define the sparsity for zero-shot and fine-tuning techniques to be the average sequence length during training divided by the max sequence length during evaluation. All zero-shot and fine-tuned models are trained on 1024 sequence length data and are evaluated against 2048 and 4096 sequence length data. In order to find an appropriate Stochastic Length baseline for these techniques, we select Stochastic Length settings which result in the same data sparsity metrics.
-
-We believe that zero-shot and fine-tuning approaches to sequence length extrapolation are not well-suited for recommendation scenarios that deal with high cardinality ids. Empirically, we observe that Stochastic Length significantly outperforms fine-tuning and zero-shot approaches. We believe that this could be due to our large vocabulary size. Zero-shot and fine-tuning approaches fail to learn good representations for older ids, which could hurt their ability to fully leverage the information contained in longer sequences.
-
-### G. Sparse Grouped GEMMs and Fused Relative Attention Bias
-
-We provide additional information about the efficient HSTU attention kernel that was introduced in Section 3.2. Our approach builds upon Memory-efficient Attention (Rabe & Staats, 2021) and FlashAttention (Dao et al., 2022), and is a memory-efficient self-attention mechanism that divides the input into blocks and avoids materializing the large $ h \times N \times N $ intermediate attention tensors for the backward pass. By exploiting the sparsity of input sequences, we can reformulate the attention computation as a group of back-to-back GEMMs with different shapes. We implement efficient GPU kernels to accelerate this computation. The construction of the relative attention bias is also a bottleneck due to memory accesses. To address this issue, we have fused the relative bias construction and the grouped GEMMs into a single GPU kernel and managed to accumulate gradients using GPU's fast shared memory in the backward pass. Although our algorithm requires recomputing attention and relative bias in the backward pass, it is significantly faster and uses less memory than the standard approach used in Transformers.### H. Microbatched-Fast Attention Leveraging Cacheable OperationNs (M-FALCON)
-
-In this section, we provide a detailed description of the M-FALCON algorithm discussed in Section 3.4. We give pseudocode for M-FALCON in Algorithm 1. M-FALCON introduces three key ideas.
-
-
-
-
-(a) GR's ranking model training (with $ n = 2n_{c} $ tokens), in causal autoregressive settings.
-
-
-
-
-
-(b) GR's ranking model inference utilizing the M-FALCON algorithm.
-
-
-Figure 11. Illustration of the M-FALCON algorithm. Top: model training in GR’s target-aware formulation. Bottom: model inference with $m$ candidates $\Phi'_0, \ldots, \Phi'_{m-1}$, divided into $\lceil m/b_m \rceil$ microbatches, where we show model inference for the first microbatch $\Phi'_0, \ldots, \Phi'_{b_{m-1}}$ (with $2n_c + b_m$ total tokens after $\Phi_0, a_0, \ldots, \Phi_{n_c-1}, a_{n_c-1}$ are taken into account) above the dotted line. Note that the self-attention algorithm is modified such that $\Phi'_i$ cannot attend to $\Phi'_j$ when $i \neq j$ - this is highlighted with “×” in the figure.
-
-
-Batched inference can be applied to causal autoregressive settings. The ranking task in GR is formulated in a target aware fashion as discussed Section 2.2. Common wisdom suggests that in a target-aware setting, we need to perform inference for one item at a time, with a cost of $ O(mn^2d) $ for m candidates and a sequence length of n. Here we show that this is not the optimal solution; even with vanilla Transformers, we can modify the attention mask used in self-attention to batch such operations (“batched inference”) and reduce cost to $ O((n+m)^2d) = O(n^2d) $.
-
-An illustration is provided in Figure 11. Here, both Figure 11 (a) and (b) involve an attention mask matrix for causal autoregressive settings. The key difference is that Figure 11 (a) uses a standard lower triangular matrix of size $ 2n_{c} $ for causal autoregressive settings.training, whereas Figure 11 (b) modifies a lower triangular matrix of size $ 2n_c + b_m $ by setting entries for $ (i, j) $s where $ i, j \geq 2n_c $, $ i \neq j $ to False or $ -\infty $ to prevent target positions $ \Phi'_0, \ldots, \Phi_{b_{m-1}} $ from attending to each other. It is easy to see that by doing so, the output of the self-attention block for $ \Phi'_i $, $ a'_i $, only depends on $ \Phi_0 $, $ a_0 $, $ \ldots $, $ \Phi_{n_c-1} $, $ a_{n_c-1} $, but not on $ \Phi'_j $ ( $ i \neq j $). In other words, by making a forward pass over $ (2n_c + b_m) $ tokens using the modified attention mask, we can now obtain the same results for the last $ b_m $ tokens as if we've made $ b_m $ separate forward passes over $ (2n_c + 1) $ tokens, with $ \Phi'_i $ placed at the $ 2n_c $-th (0-based) position during the $ i $-th forward pass utilizing a standard causal attention mask.
-
-Microbatching scales batched inference to large candidate sets. Ranking stage may need to deal with a large number of ranking candidates, up to tens of thousands (Wang et al., 2020). We can divide the overall $m$ candidates into $\lceil m/b_m \rceil$ microbatches of size $b_m$ such that $O(b_m) = O(n)$, which retains the $O((n + m)^2 d) = O(n^2 d)$ running time previously discussed for most practical recommender settings, up to tens of thousands of candidates.
-
-Encoder-level caching enables compute sharing within and across requests. Finally, KV caching (Pope et al., 2022) can be applied both within and across requests. For instance, for the HSTU model presented in this work (Section 3), $ K(X) $ and $ V(X) $ are fully cachable across microbatches within and/or across requests. For a cached forward pass, we only need to compute $ U(X) $, $ Q(X) $, $ K(X) $, and $ V(X) $ for the last $ b_m $ tokens, while reusing cached $ K(X) $ and $ V(X) $ for the sequentialized user history containing n tokens. $ f_2(\text{Norm}(A(X)V(X)) \odot U(X)) $ similarly only needs to be recomputed for the $ b_m $ candidates. This reduces the cached forward pass's computational complexity to $ O(b_m d^2 + b_m nd) $, which significantly improves upon $ O((n + b_m)d^2 + (n + b_m)^2 d) $ by a factor of 2-4 even when $ b_m = n $.
-
-Algorithm 1 M-FALCON Algorithm.
-
-1: Input: Merged token series $ x_0, x_1, \ldots, x_{n-1} $ (can be e.g., $ (\Phi_0, a_0, \ldots, \Phi_{n_c-1}, a_{n_c-1}) $ where $ n = 2n_c $); m ranking candidates $ \Phi'_0, \ldots, \Phi'_{m-1} $; a b-layer h-heads self-attention model trained in causal autoregressive settings (e.g., HSTU or Transformers) $ f(X, cacheStates, attnMask) \to (X', updatedCacheStates) $ where $ X, X' \in \mathbb{R}^{N \times d} $, attnMask $ \in \mathbb{R}^{N \times N} $, and cachedStates, updatedCacheStates $ \in \mathbb{R}^{b \times h \times N \times d_{qk}} \times \mathbb{R}^{b \times h \times N \times d_{qk}} $ (due to caching $ K(X) $s and $ V(X) $s across b layers); microbatch size $ b_m $, where we assume m is a multiple of $ b_m $ for simplicity.
-
-2: Output: Predictions for all m ranking candidates, $ (a'_0, \ldots, a'_{m-1}) $.
-
-3: numMicrobatches = $ (m + b_m - 1) // b_m $
-
-4: attnMask = $ L_{n+b_m} $ $ \{L_{n+b_m}\} $ represents a lower triangular matrix. Lower triangular entries are 0s, the rest are $ -\infty $.
-
-5: attnMask[i, j] = - $ \infty $ for i, j $ \geq n $, i $ \neq j $ $ \{This $ prevents the last $ b_m $ entries from attending to each other.\}
-
-6: $ (a'_0, a'_1, \ldots, a'_{b_m-1}) $, $ kvCache \leftarrow f(embLayer((x_0, x_1, \ldots, x_{n-1}, \Phi'_0, \ldots, \Phi'_{b_m-1})), \varnothing, attnMask) $
-
-7: predictions = $ (a'_0, a'_1, \ldots, a'_{b_m-1}) $
-
-8: i = 1
-
-9: while i < numMicrobatches do
-
-10: $ (a'_{b_m, i}, a'_{b_m+1}, a'_{b_m(i+1)-1}), \ldots \leftarrow f(embLayer((x_0, x_1, \ldots, x_{n-1}, \Phi'_{b_m, i}, \ldots, \Phi'_{b_m(i+1)-1})), kvCache, attnMask) $
-
-11: predictions $ \leftarrow $ predictions + $ (a'_{b_m, i}, a'_{b_m+1}, \ldots, a'_{b_m(i+1)-1}) $
-
-12: i $ \leftarrow $ i + 1
-
-13: end while
-
-14: return predictions
-
-Algorithm 1 is illustrated in Figure 11 to help with understanding. We remark that M-FALCON is not only applicable to HSTUs and GRs, but also broadly applicable as an inference optimization algorithm for other target-aware causal autoregressive settings based on self-attention architectures.
-
-#### H.1. Evaluation of Inference Throughput: Generative Recommenders (GRs) w/ M-FALCON vs DLRMs
-
-As discussed in Section 3.4, M-FALCON handles $ b_{m} $ candidates in parallel to amortize computation costs across all m candidates at inference time. To understand our design, we compare the throughput (i.e., the number of candidates scored per second, QPS) of GRs and DLRMs based on the same hardware setups.
-
-As shown in Figure 12 and Figure 13, GRs' throughput scales in a sublinear way based on the number of ranking-stage candidates (m), up to a certain region - m = 2048 in our case study - due to batched inference enabling cost amortization. This confirms the criticality of batched inference in causal autoregressive settings. Due to attention complexity scaling as $ O((n + b_m)^2) $, leveraging multiple microbatches by itself improves throughput. Caching further eliminates redundant linear and attention computations on top of microbatching. The two combined resulted in up to 1.99x additional speedups relativeto the $ b_{m} = m = 1024 $ baseline using a single microbatch, as shown in Figure 13. Overall, with the efficient HSTU encoder design and utilizing M-FALCON, HSTU-based Generative Recommenders outperform DLRMs in terms of throughput on a large-scale production setup by up to 2.99x, despite GRs being 285x more complex in terms of FLOPs.
-
-
-
-
-Figure 12. End-to-end inference throughput: DLRMs vs GRs (w/ M-FALCON) in large-scale industrial settings. Note that this figure is the same as Figure 6, and is reproduced here to facilitate reading.
-
-
-
-
-
-Figure 13. End-to-end inference throughput: M-FALCON throughput scaling, on top of the 285x FLOPs GR model, in large batch settings where m (total number of ranking candidates) ranges from 1024 to 16384, and $ b_{m} = 1024 $.
+Covington, P., Adams, J., and Sargin, E. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, RecSys '16, pp. 191–198, 2016. ISBN 9781450340359.
diff --git a/论文/imgs/grab/img_in_chart_box_110_146_498_401.jpg b/论文/imgs/grab/img_in_chart_box_110_146_498_401.jpg
new file mode 100644
index 0000000..644802c
Binary files /dev/null and b/论文/imgs/grab/img_in_chart_box_110_146_498_401.jpg differ
diff --git a/论文/imgs/grab/img_in_chart_box_172_627_504_881.jpg b/论文/imgs/grab/img_in_chart_box_172_627_504_881.jpg
new file mode 100644
index 0000000..10cb074
Binary files /dev/null and b/论文/imgs/grab/img_in_chart_box_172_627_504_881.jpg differ
diff --git a/论文/imgs/grab/img_in_chart_box_694_136_1079_398.jpg b/论文/imgs/grab/img_in_chart_box_694_136_1079_398.jpg
new file mode 100644
index 0000000..2f7f6fd
Binary files /dev/null and b/论文/imgs/grab/img_in_chart_box_694_136_1079_398.jpg differ
diff --git a/论文/imgs/grab/img_in_image_box_115_151_689_476.jpg b/论文/imgs/grab/img_in_image_box_115_151_689_476.jpg
new file mode 100644
index 0000000..d034e85
Binary files /dev/null and b/论文/imgs/grab/img_in_image_box_115_151_689_476.jpg differ
diff --git a/论文/imgs/grab/img_in_image_box_117_147_1083_497.jpg b/论文/imgs/grab/img_in_image_box_117_147_1083_497.jpg
new file mode 100644
index 0000000..3c2c0c0
Binary files /dev/null and b/论文/imgs/grab/img_in_image_box_117_147_1083_497.jpg differ
diff --git a/论文/imgs/grab/img_in_image_box_137_140_560_318.jpg b/论文/imgs/grab/img_in_image_box_137_140_560_318.jpg
new file mode 100644
index 0000000..820c5b5
Binary files /dev/null and b/论文/imgs/grab/img_in_image_box_137_140_560_318.jpg differ
diff --git a/论文/imgs/grab/img_in_image_box_190_187_498_411.jpg b/论文/imgs/grab/img_in_image_box_190_187_498_411.jpg
new file mode 100644
index 0000000..7ba3934
Binary files /dev/null and b/论文/imgs/grab/img_in_image_box_190_187_498_411.jpg differ
diff --git a/论文/imgs/grab/img_in_image_box_218_138_976_460.jpg b/论文/imgs/grab/img_in_image_box_218_138_976_460.jpg
new file mode 100644
index 0000000..362acc2
Binary files /dev/null and b/论文/imgs/grab/img_in_image_box_218_138_976_460.jpg differ
diff --git a/论文/imgs/grab/img_in_image_box_221_139_977_474.jpg b/论文/imgs/grab/img_in_image_box_221_139_977_474.jpg
new file mode 100644
index 0000000..75ae8dd
Binary files /dev/null and b/论文/imgs/grab/img_in_image_box_221_139_977_474.jpg differ
diff --git a/论文/imgs/grab/img_in_image_box_246_907_444_1138.jpg b/论文/imgs/grab/img_in_image_box_246_907_444_1138.jpg
new file mode 100644
index 0000000..ce03154
Binary files /dev/null and b/论文/imgs/grab/img_in_image_box_246_907_444_1138.jpg differ
diff --git a/论文/imgs/grab/img_in_image_box_738_151_1080_498.jpg b/论文/imgs/grab/img_in_image_box_738_151_1080_498.jpg
new file mode 100644
index 0000000..bbe1871
Binary files /dev/null and b/论文/imgs/grab/img_in_image_box_738_151_1080_498.jpg differ
diff --git a/论文/imgs/hstu/img_in_chart_box_111_654_336_834.jpg b/论文/imgs/hstu/img_in_chart_box_111_654_336_834.jpg
new file mode 100644
index 0000000..f03fdd2
Binary files /dev/null and b/论文/imgs/hstu/img_in_chart_box_111_654_336_834.jpg differ
diff --git a/论文/imgs/hstu/img_in_chart_box_131_1212_330_1373.jpg b/论文/imgs/hstu/img_in_chart_box_131_1212_330_1373.jpg
new file mode 100644
index 0000000..58a708c
Binary files /dev/null and b/论文/imgs/hstu/img_in_chart_box_131_1212_330_1373.jpg differ
diff --git a/论文/imgs/hstu/img_in_chart_box_132_124_555_383.jpg b/论文/imgs/hstu/img_in_chart_box_132_124_555_383.jpg
new file mode 100644
index 0000000..77ca53f
Binary files /dev/null and b/论文/imgs/hstu/img_in_chart_box_132_124_555_383.jpg differ
diff --git a/论文/imgs/hstu/img_in_chart_box_135_394_553_647.jpg b/论文/imgs/hstu/img_in_chart_box_135_394_553_647.jpg
new file mode 100644
index 0000000..0a9d900
Binary files /dev/null and b/论文/imgs/hstu/img_in_chart_box_135_394_553_647.jpg differ
diff --git a/论文/imgs/hstu/img_in_chart_box_136_657_552_901.jpg b/论文/imgs/hstu/img_in_chart_box_136_657_552_901.jpg
new file mode 100644
index 0000000..7f9139e
Binary files /dev/null and b/论文/imgs/hstu/img_in_chart_box_136_657_552_901.jpg differ
diff --git a/论文/imgs/hstu/img_in_chart_box_175_710_511_907.jpg b/论文/imgs/hstu/img_in_chart_box_175_710_511_907.jpg
new file mode 100644
index 0000000..35b0cf9
Binary files /dev/null and b/论文/imgs/hstu/img_in_chart_box_175_710_511_907.jpg differ
diff --git a/论文/imgs/hstu/img_in_chart_box_179_423_503_668.jpg b/论文/imgs/hstu/img_in_chart_box_179_423_503_668.jpg
new file mode 100644
index 0000000..3379266
Binary files /dev/null and b/论文/imgs/hstu/img_in_chart_box_179_423_503_668.jpg differ
diff --git a/论文/imgs/hstu/img_in_chart_box_180_131_507_377.jpg b/论文/imgs/hstu/img_in_chart_box_180_131_507_377.jpg
new file mode 100644
index 0000000..160da86
Binary files /dev/null and b/论文/imgs/hstu/img_in_chart_box_180_131_507_377.jpg differ
diff --git a/论文/imgs/hstu/img_in_chart_box_234_650_969_970.jpg b/论文/imgs/hstu/img_in_chart_box_234_650_969_970.jpg
new file mode 100644
index 0000000..cca0329
Binary files /dev/null and b/论文/imgs/hstu/img_in_chart_box_234_650_969_970.jpg differ
diff --git a/论文/imgs/hstu/img_in_chart_box_308_220_881_559.jpg b/论文/imgs/hstu/img_in_chart_box_308_220_881_559.jpg
new file mode 100644
index 0000000..7e112e7
Binary files /dev/null and b/论文/imgs/hstu/img_in_chart_box_308_220_881_559.jpg differ
diff --git a/论文/imgs/hstu/img_in_chart_box_346_660_570_836.jpg b/论文/imgs/hstu/img_in_chart_box_346_660_570_836.jpg
new file mode 100644
index 0000000..9d1f086
Binary files /dev/null and b/论文/imgs/hstu/img_in_chart_box_346_660_570_836.jpg differ
diff --git a/论文/imgs/hstu/img_in_chart_box_368_1212_568_1372.jpg b/论文/imgs/hstu/img_in_chart_box_368_1212_568_1372.jpg
new file mode 100644
index 0000000..10e64b1
Binary files /dev/null and b/论文/imgs/hstu/img_in_chart_box_368_1212_568_1372.jpg differ
diff --git a/论文/imgs/hstu/img_in_chart_box_607_1212_806_1371.jpg b/论文/imgs/hstu/img_in_chart_box_607_1212_806_1371.jpg
new file mode 100644
index 0000000..dda4f79
Binary files /dev/null and b/论文/imgs/hstu/img_in_chart_box_607_1212_806_1371.jpg differ
diff --git a/论文/imgs/hstu/img_in_chart_box_669_386_1026_685.jpg b/论文/imgs/hstu/img_in_chart_box_669_386_1026_685.jpg
new file mode 100644
index 0000000..5b38f7d
Binary files /dev/null and b/论文/imgs/hstu/img_in_chart_box_669_386_1026_685.jpg differ
diff --git a/论文/imgs/hstu/img_in_chart_box_671_123_1023_335.jpg b/论文/imgs/hstu/img_in_chart_box_671_123_1023_335.jpg
new file mode 100644
index 0000000..c941f5c
Binary files /dev/null and b/论文/imgs/hstu/img_in_chart_box_671_123_1023_335.jpg differ
diff --git a/论文/imgs/hstu/img_in_chart_box_843_1212_1043_1371.jpg b/论文/imgs/hstu/img_in_chart_box_843_1212_1043_1371.jpg
new file mode 100644
index 0000000..64c55a0
Binary files /dev/null and b/论文/imgs/hstu/img_in_chart_box_843_1212_1043_1371.jpg differ
diff --git a/论文/imgs/hstu/img_in_image_box_116_139_1077_399.jpg b/论文/imgs/hstu/img_in_image_box_116_139_1077_399.jpg
new file mode 100644
index 0000000..bd1f7d6
Binary files /dev/null and b/论文/imgs/hstu/img_in_image_box_116_139_1077_399.jpg differ
diff --git a/论文/imgs/hstu/img_in_image_box_205_252_758_582.jpg b/论文/imgs/hstu/img_in_image_box_205_252_758_582.jpg
new file mode 100644
index 0000000..132c111
Binary files /dev/null and b/论文/imgs/hstu/img_in_image_box_205_252_758_582.jpg differ
diff --git a/论文/imgs/hstu/img_in_image_box_208_629_983_1103.jpg b/论文/imgs/hstu/img_in_image_box_208_629_983_1103.jpg
new file mode 100644
index 0000000..6ace2dd
Binary files /dev/null and b/论文/imgs/hstu/img_in_image_box_208_629_983_1103.jpg differ
diff --git a/论文/imgs/hstu/img_in_image_box_267_124_916_513.jpg b/论文/imgs/hstu/img_in_image_box_267_124_916_513.jpg
new file mode 100644
index 0000000..c67595d
Binary files /dev/null and b/论文/imgs/hstu/img_in_image_box_267_124_916_513.jpg differ
diff --git a/论文/imgs/hstu/img_in_image_box_307_135_886_606.jpg b/论文/imgs/hstu/img_in_image_box_307_135_886_606.jpg
new file mode 100644
index 0000000..8b01017
Binary files /dev/null and b/论文/imgs/hstu/img_in_image_box_307_135_886_606.jpg differ
diff --git a/论文/imgs/hstu/img_in_image_box_643_129_1062_604.jpg b/论文/imgs/hstu/img_in_image_box_643_129_1062_604.jpg
new file mode 100644
index 0000000..0cd4d3b
Binary files /dev/null and b/论文/imgs/hstu/img_in_image_box_643_129_1062_604.jpg differ