REGEN: Empowering personalized recommendations with natural language

June 27, 2025

Krishna Sayana, Software Engineer, and Hubert Pham, Research Scientist, Google Research

We present a new benchmark dataset to help LLMs provide more contextualized recommendations through natural language interactions.

Large language models (LLMs) are reshaping how recommender systems interact with users. Traditional recommendation pipelines focus on predicting the next item a user might like — books, shoes, office supplies, etc. — based on past interactions. But the real goal goes further: we want systems that interact with users, understand their needs, adapt through natural language feedback, and explain why a recommendation makes sense. However, no datasets currently exist to explore these new capabilities.

To address this gap we developed Reviews Enhanced with GEnerative Narratives (REGEN), a new benchmark dataset that incorporates item recommendations, natural language features composed of synthetic user critiques, and personalized narratives comprising purchase reasons and product endorsements. Rather than start from scratch, we augmented the widely-used Amazon Product Reviews dataset by synthesizing missing conversational elements with the help of Gemini 1.5 Flash. This dataset allows us to explore and benchmark new recommender architectures that incorporate both user feedback (e.g., FLARE) as well those that output natural language consistent with the recommendations (e.g., LUMEN). Our results show that LLMs trained on our dataset effectively generate both recommendations and contextual narratives, achieving performance comparable to state-of-the-art recommenders and language models.

Building the REGEN dataset

Existing datasets for training conversational recommenders often fall short in capturing the nuances of real-world conversations. They may focus on sequential item prediction, short dialog snippets, or lack explicit user feedback. We chose the Amazon Product Reviews dataset because of its specific utility for large vocabularies, potentially unfamiliar to an LLM.

REGEN enriches the Amazon Reviews dataset with two key components:

Critiques

Critiques are a crucial aspect of conversational recommendation, allowing users to express their preferences and guide the system. In REGEN, critiques are generated to steer the recommender from a current item to a similar, desired item. For example, a user might critique a "red ball-point pen" by saying, "I'd prefer a black one".

To ensure the relevance of critiques, we generate them only for adjacent item pairs that are sufficiently similar, using the Amazon Reviews–provided hierarchical item categories as a proxy for similarity. The Gemini 1.5 Flash model generates several critique options for each pair, from which we select one at random to include in the dataset.

Narratives

Narratives provide rich contextual information about recommended items, enhancing the user experience. REGEN includes diverse narratives, such as:

  • Purchase reasons: Explanations for why an item might be suitable for a user.
  • Product endorsements: Descriptions highlighting the benefits and features of an item.
  • User summaries: Concise profiles of user preferences and purchase history.

These narratives vary in contextualization and length, providing a rich dataset for training conversational recommenders.

Experiments

To evaluate REGEN effectively, we didn’t just want to test if models could recommend the right item, we wanted to see if they could communicate their reasoning, adapt to feedback, and generate language that feels tailored to the user. So we framed a new kind of task: conversational recommendation that’s jointly generative. The idea is simple but powerful — given a provided purchase history, and optionally a natural language critique (e.g., “I need something with more storage”), a model must recommend the next item and generate a contextual narrative about it.

This task reflects how users naturally interact with recommendation systems when given the opportunity to express preferences in their own words. It also moves away from disjointed modeling, where recommendation and language generation are handled separately. Instead, we treat both as part of a unified, end-to-end objective.

To explore different modeling approaches, we developed and implemented two baseline architectures. The first is a hybrid system, where a sequential recommender (FLARE) predicts the next item based on collaborative filtering and content signals. That output is then fed into a lightweight LLM (Gemma 2B), which is responsible for generating the narrative. This setup reflects a common architecture in production systems, where different components specialize in different stages of the pipeline.

play silent looping video pause silent looping video

The second architecture is LUMEN (LLM-based Unified Multi-task Model with Critiques, Recommendations, and Narratives). LUMEN does everything inside a single LLM. It’s trained end-to-end to handle critiques, generate recommendations, and produce narratives in a coherent way. During decoding, the model decides when to emit an item ID and when to continue generating natural language. We modified the vocabulary and embedding layers to support both types of outputs — item tokens and text tokens — which allowed the model to treat item recommendation as just another part of the generative process.

play silent looping video pause silent looping video

This dual approach — hybrid versus fully generative — lets us benchmark the trade-offs between modularity and integration, and provides a solid foundation for measuring how well models can tackle this more holistic conversational task.

Results

Our experiments show that REGEN can meaningfully challenge and differentiate models across both recommendation and generation tasks. In the Amazon Product Reviews dataset's Office domain, we observed that incorporating user critiques into the input consistently improved recommendation metrics across both architectures. For example, the FLARE hybrid model’s already state-of-the-art performance (0.124) on a metric which measures how often a desired item appears in the top 10 predicted results (known as Recall@10) increased to 0.1402 when critiques were included in the Office dataset, a notable bump that underscores the value of language-guided refinement.

LUMEN’s performance was competitive, albeit slightly lower on traditional recommendation metrics. That’s not surprising, given the increased difficulty of generating the item and narrative jointly in a single pass. However, its real strength lies in its ability to maintain coherence between the item and the text it produces. Unlike modular pipelines, where disconnects between components can lead to awkward or generic explanations, LUMEN’s narratives tend to align more naturally with the user’s history and critique context.

On the generation side, we evaluated outputs using BLEU, ROUGE, and semantic similarity. The hybrid model generally scored higher on BLEU and ROUGE, especially for product endorsements and purchase reasons, likely because the LLM was given the correct item as a prompt. LUMEN, by contrast, had slightly lower n-gram overlap but maintained strong semantic alignment, particularly for user summaries that relied more on long-term user behavior than on the specific item (see paper for details).

FLARE-3a-Benchmarks

These results highlight a few interesting dynamics. Narratives that depend primarily on the user, like summaries of their preferences, are easier for both models to generate consistently. But when the narrative is tightly coupled with the item context, like a product endorsement, performance hinges more on recommendation accuracy. If the model recommends the wrong item, it can throw off the entire narrative. This effect is more pronounced in LUMEN, where both the item and the narrative are co-generated, making it a stricter test of end-to-end alignment.

We also evaluated performance on a much larger item space using the Clothing domain, which has over 370,000 unique items (5x–60x larger than any other product category). No one else we’re aware of performs evaluations on this much larger Clothing dataset, a key distinction of FLARE and REGEN. Even in this more complex setting, the hybrid system held up well, and again we saw clear gains in Recall@10 from 0.1264 to 0.1355 when critiques were included, validating the design of REGEN as a benchmark that rewards nuanced, user-guided reasoning.

FLARE-Recommendation

Conclusion

REGEN provides a dataset with consistent user preferences, recommendations, and generated narratives, enabling the study of LLM capabilities in conversational recommendation. We evaluated REGEN using LUMEN, an LLM-based model for joint recommendation and narrative generation, demonstrating its utility, along with sequential recommender models. We believe REGEN serves as a fundamental resource for studying the capabilities of conversational recommender models, a crucial step towards personalized multi-turn systems.

REGEN advances conversational recommendation by integrating language as a fundamental element, enhancing how recommenders interpret and respond to user preferences. This approach fosters research into multi-turn interactions, where systems can engage in extended dialogues to refine recommendations based on evolving user feedback.

The dataset also encourages the development of more sophisticated models and training methodologies. It supports exploration into scaling model capacity, utilizing advanced training techniques, and adapting the methodology across different domains beyond Amazon reviews, such as travel, education, and music.

Ultimately, REGEN sets a new direction for recommender systems, emphasizing comprehension and interaction, which paves the way for more intuitive, supportive, and human-like recommendation experiences.

Acknowledgements

We would like to thank our co-authors of the FLARE and REGEN papers without whom this work would not be possible: Liam Hebert from University of Waterloo and Kun Su, James Pine, Marialena Kyriakidi, Yuri Vasilevski, Raghavendra Vasudeva, Ambarish Jash, Sukhdeep Sodhi, Anushya Subbiah, from Google Research. Additionally, we are grateful for the support and guidance of our leadership Vikram Aggarwal, John Anderson, Dima Kuzmin, Emil Praun and Sarvjeet Singh. We also are grateful to Kimberly Schwede, Mark Simborg and the Google Research Blog editorial staff for helping us present our work to a larger audience. Finally we appreciate the authors of “Justifying Recommendations Using Distantly-Labeled Reviews and Fine-Grained Aspects” for releasing the Amazon Product Reviews dataset used in our work.