More than a month after launching AI-generated review summaries in iOS 18.4, Apple is offering a detailed look at how the technology works behind the scenes. In a recent post on its Machine Learning Research blog, Apple shed light on the multi-step large language model (LLM) system that powers these summaries—revealing how it filters, analyzes, and condenses thousands of user reviews into concise, helpful snapshots.
Designed to help users quickly grasp the general sentiment around an app or game, the feature relies on a custom-built, multi-phase AI pipeline. While the summaries have been live since the iOS 18.4 update, Apple’s new explanation offers valuable insight into the thought process and engineering that went into their development.
At the core of the system is a structured LLM framework that prioritizes four main principles: safety, fairness, truthfulness, and helpfulness. This ensures the summaries not only reflect real user sentiment, but also do so in a balanced and responsible way.

The process starts by filtering out spam, profanity, and fraudulent content. Reviews that pass this initial screen are then sent through several LLM-powered stages. First, the system extracts individual “insights”—specific, focused statements that capture one aspect of a user’s experience. These insights are standardized, helping the system maintain consistency across varied review styles and tones.
Next comes dynamic topic modeling, which identifies and organizes common themes without relying on a fixed list of categories. Apple’s models use semantic analysis to group similar topics, while deprioritizing irrelevant or off-topic content—like a food review in the context of a delivery app. Priority is given to app-related feedback such as usability, performance, and design.
From there, the most relevant and representative insights are selected to shape the final summary. This ensures the AI captures a full range of opinions—positive, negative, and neutral—while maintaining alignment with the app’s overall user rating.
The final step involves a third LLM trained to generate the summary itself. Using techniques like LoRA fine-tuning and Direct Preference Optimization (DPO), Apple refined the model’s ability to produce clear, natural summaries in just 100–300 characters. These summaries aim to be informative and readable, matching the tone and style expected in the App Store.
To evaluate summary quality, Apple conducted extensive testing. Thousands of summaries were reviewed by human raters based on four criteria: safety, groundedness (accuracy), composition, and helpfulness. Summaries had to meet strict guidelines before being approved, with safety reviews requiring unanimous agreement from all evaluators.
Comments
Loading…