Reasoning + RL: A new recipe for AI apps

I explore why forward-looking AI builders have moved from RAG to reasoning + RL and share the roadmap they’re charting for others.

Jul 26, 2025

At ICML earlier this month (one of the year’s most important AI research conferences), reinforcement learning (RL) for LLMs dominated the agenda. This confirmed something I’ve observed in my conversations with builders over the past several months: the center of gravity for AI-focused founders has moved from RAG to RL, from knowledge retrieval to reasoning and decision-making.

This came home for me again on Monday when both Google and OpenAI achieved gold-medal performance on the International Math Olympiad (IMO) using LLMs powered by advanced reasoning and RL techniques. It’s the first time AI systems have ever hit this milestone, and it’s clear proof of the power of reasoning + RL.

RL is not new. So why the sudden resurgence of RL? And, more importantly, what does it mean for AI-focused founders?

To get clear signal, I spoke to three founders in FC’s portfolio who are at the forefront of this new reasoning + RL paradigm for building AI apps: Animesh Koratana, CEO of PlayerZero (AI for engineering quality); Ishan Chhabra, CEO of Oliv AI (AI agents for revenue teams), and Kabir Nagrecha, CEO of Tessera Labs (transforming enterprise business workflows with advanced reasoning + RL techniques).

This month, drawing on my conversations with Animesh, Ishan, and Kabir, I explore the shift from RAG to reasoning + RL, why it’s happening now, and what it means for AI builders.

Reasoning + RL lets software eat services

The short answer: RL is back in the spotlight because AI systems are being asked to think, act, and adapt in pursuit of business goals. Massive pretraining runs have given modern LLMs a broad map of how language works. The next frontier is getting them to do actual work. This means teaching them how to reason and take multi-step actions towards a goal.

This is where RL comes in. RL offers a way to train AI models via feedback and rewards so they can learn from outcomes and improve their decision-making over time. In a classic RL setup, an AI agent takes an action, and its environment returns a reward signal indicating the success of that action. Over time, the agent adjusts its strategy to increase desirable outcomes.

RL was key to creating the new class of reasoning models released publicly over the past 9 months, from OpenAI’s “o” series to DeepSeek’s R1 and Google’s Gemini 2.0. Alongside a new generation of base models, this reasoning + RL recipe is also powering a new class of AI apps built around agents, tool use, and long-horizon decision-making.

As Kabir explained, real-world enterprise processes span dozens of systems and thousands of conditional steps. Small decisions at the right moment can swing outcomes by orders of magnitude. You need models that can plan, adapt, and act, not just think. The combination of reasoning and RL helps teach them to do that effectively.

The “why now” for RL

Why are we seeing RL resurge as a central vector of AI advancement now? Why RL and not, say, synthetic data or new model architectures? There isn’t a single answer, but a few factors stand out:

Back to OpenAI’s roots: Long before ChatGPT, OpenAI was building game-playing agents, teaching robots to solve Rubik’s Cubes, and developing RL algorithms. The pendulum has swung back in that direction. If another company were leading the charge (say, Meta), we might be hearing more about leveraging social interaction data or personalization.
Limits of pre-training and fine-tuning: Over the past year, the gains from ever-larger pre-training runs have started to plateau. Simply predicting the next token doesn’t directly optimize for solving complex problems. Models like GPT-4 have vast implicit knowledge, yet they still make basic reasoning errors and lose coherence on long tasks. Fine-tuning on domain-specific data can make a model more knowledgeable in a niche, but it doesn’t inherently teach the model how to solve new problems and carry out extended tasks. RL pushes those capabilities further.
The need for new data: By 2023, models had essentially ingested the entire internet of text data - there isn’t much unseen knowledge left in pre-training corpora. By optimizing for an objective, RL generates its own training data through real interactions. You can view RL as an extreme case of a data-sparse regime: the agent learns from each attempt’s outcome, creating new data as it goes. From a builder’s standpoint, it answers the question: “How do I get the most relevant data for my application?” The answer is: don’t try to fabricate data in a vacuum - train on your application. In other words, have the model learn from the real decisions and scenarios that occur within your software.

Better RL techniques: The AI community has discovered that giving models more time to think yields better results - but to capitalize on this, we need ways to manage that thinking. This realization has inspired new research on blending LLMs with novel RL algorithms (like the DeepSeek team’s GRPO) so that the model can try out different paths through a space of possible steps and learn which strategies are most likely to succeed. The maturation of RL methods has made it feasible to push the envelope on reasoning even further.

How AI apps are evolving

Many early AI apps were LLM wrappers around a search engine or vector database. The developer’s job was to engineer good retrieval (using keywords or embeddings) and then have the LLM synthesize an answer from the retrieved text.

Today, a reasoning agent can break a task into multiple steps, plan its approach, gather information as needed in each step, and assemble the final result. Crucially, this whole workflow can be learned or handled by the model itself.

Let’s illustrate this with a concrete example. Imagine a user asks: “What are the differences between product Gong and product Clari?”

The RAG approach (the old way): A RAG-based system might take this query and search the web for pages that mention “Gong vs Clari.” Suppose it finds 10 relevant articles. It then feeds snippets from those into the LLM prompt, and the LLM generates a summary or answer. If the answer isn’t great, a developer might improve it by refining the search query, adding more context, or doing query expansion (e.g. also search for related terms like “Gong pricing” or “Clari features”). Essentially, the developer is figuring out what additional queries or data might lead to a better answer, and adjusting the retrieval component accordingly. This approach is similar to what search engines like Google have focused on for years: given a user query, find and rank the most relevant results. In this setup, the LLM then summarizes those results in one pass.
The reasoning agent approach (the new way): A reasoning agent tackles the same query by figuring out how to answer it. For example, a reasoning agent might internally monologue: “The user is asking for a comparison between Gong and Clari. I should find information on features, pricing, and user opinions for each. Maybe search for reviews of Gong, reviews of Clari, and any direct comparison articles.” The agent then generates a series of actions: it performs multiple searches (like “Gong product features,” “Clari vs Gong case study,” etc.), each time reading the results and deciding the next step. It could branch out to explore specific aspects (maybe it finds something about integration capabilities and decides to look deeper there). Finally, it compiles the findings into an answer for the user. All of this (the planning of search queries, deciding when it has enough info, and synthesizing) is driven by the model’s own reasoning policy.

Put simply, the control logic is shifting from humans to AI. The most advanced builders are spending less time optimizing prompt engineering and retrieval heuristics, and more time training AI systems (with reasoning + RL) to handle that logic themselves.

What this means for AI builders

What does this all mean for founders? Here are a few early best practices we’re seeing:

🧠 1. Treat reasoning as your product’s superpower

Reasoning models started off as cutting-edge but slow research prototypes. But we know how this story goes: today’s expensive model is a commodity in a few months. Optimization work is underway to make reasoning models more efficient, distill them into smaller versions, and run them on specialized hardware. In the coming months, tasks that were too latency-sensitive or too expensive for a reasoning agent approach will open up.
That means you should be ready for on-demand reasoning to become viable for your product. Identify moments where your users hit walls because the AI gives shallow, single-shot responses. Where would your product experience be 10x better if the AI could break down complex requests, pursue sub-goals, and synthesize information across multiple steps before responding?

🛠️ 2. Focus on domain-specific understanding and evals

In the 2023 paradigm, if you had a mountain of data (say all the claims assessment guidelines for an insurance company, or a unique dataset of legal cases), you could fine-tune an LLM or use RAG to make a domain-expert bot. Your moat was that data: others didn't have it, so their models would be less knowledgeable in that niche.
In the new paradigm, static data is not enough. We foresee that the hardest part of building high-performing reasoning agents will be coming up with excellent evals and reward functions for your specific domain. Your IP becomes the sophisticated eval framework that captures what success truly means for your users, and the dynamic data that comes from users interacting with your product.
In domains like coding or math, evaluating success on many tasks is straightforward (Did the program run? Is the answer correct?). In fuzzier domains (shopping recommendations, essay writing, customer support, etc.) you need to get creative. Consider rating systems, human preference comparisons ("Which output is better?"), and proxy metrics (clicks, task completion rates). The key is turning your product intuition into clear, optimizable signals. There’s no universal template here; eval design starts with deeply understanding what success means in your domain.

🔁 3. Build for continuous improvement from day one

With RL-based systems, every user interaction spins your data flywheel faster. Suppose you're building an AI coding agent. Each debugging session generates rich feedback signals: Did the fix resolve the bug? Did tests pass? Did the developer accept the suggestion? These outcomes immediately become training data for your RL loop.
This creates a compounding advantage: better reasoning → happier users → more usage → more feedback data → even better reasoning. A competitor starting with the same base model but lacking the stream of interaction data you're continuously generating and learning from will fall further behind each day.
To capitalize on this, design your product from the beginning to capture outcomes and feed them back into your models. Identify the key moments of truth in your user experience (Was the answer helpful or not? Did the user correct the AI’s action? Did the agent’s decision lead to a successful result or an error?) and make sure you log them. Even if you’re not ready to fully train a model on those signals yet, start gathering the data now - you’ll thank yourself later.

🧩 4. Prime your reasoning agent with process data

In the RAG paradigm, you might drop a knowledge base into a vector store. In the reasoning + RL paradigm, you want to build a process knowledge base. For instance, if you’re automating warehouse inventory management, you’d want to collect full walkthroughs of how the best human operations manager would handle a tricky inventory balancing problem. That could be in the form of annotated step-by-step examples, transcripts, or workflows.
The bar for domain expertise is higher here than it was for RAG (where you could sometimes get by with just dumping documents). Now you need to distill what an expert would actually do. But once you have that, your agent can learn much more effectively. Remember: an AI system is only as good as the feedback and guidance it gets. Supplying some “golden path” examples of reasoning is one of the best forms of guidance you can give it.

🧯5. Prepare for new failure modes

RL-based systems can introduce new kinds of failures, so builders need to be vigilant. One issue is reward hacking, where the AI finds a shortcut to get a high reward that isn’t actually what you intended (a classic RL problem). To mitigate this, you need to design reward functions carefully and often include multiple objectives or constraints (e.g. maximizing profits is good but only within the bounds of valid accounting principles). Domain expertise is crucial here: you need to anticipate ways the agent might go astray and guard against them.
Another issue is unpredictability. An RL-trained agent might come up with an unexpected way to solve a problem - sometimes brilliant, sometimes nonsense. Failures can cascade over multi-step workflows. This can make debugging trickier than with a single-turn QA system.
In high-stakes domains, you’ll want to keep a human in the loop - e.g. an AI agent drafts an analysis, but a human signs off before any funds are moved or emails sent. Build in circuit breakers and escalation paths. And take advantage of the transparency that reasoning traces offer: unlike a GPT model that just outputs an answer, a reasoning agent that “thinks out loud” lets you catch issues and continually refine its decision-making.

What’s next for reasoning + RL?

We’re still in the early days of this new paradigm, and it’s evolving rapidly. Here are a few trends we anticipate in the coming year as the industry embraces the reasoning + RL approach:

More efficient reasoners: Right now, the frontier reasoning models are large and computationally heavy. Future research and engineering efforts will focus on making these capabilities more efficient. This will involve developing smarter algorithms for managing the "thinking" process in LLMs to streamline exploration and concentrate on the most promising reasoning paths. In effect, models will get smarter about how they use their thinking time.
Better RL techniques for open-ended tasks: We expect a lot more work on RL algorithms that are sample-efficient and robust in language domains. Think of things like human preference modeling but on steroids: methods that can squeeze the most learning out of every example of human feedback or every trial the agent does in its environment. We may also see model-based RL techniques, where the AI uses a mental model to simulate outcomes before taking an action.
Tooling for feedback and evals: We may see a new crop of products and tools to support the reasoning + RL product development process. This could be a platform that helps you define custom evals for your agent and automatically measure its performance on them. Or services that plug into your app to capture user feedback at key moments and help improve your model based on those signals. Another angle is tools for domain experts to input their knowledge more directly: maybe a UI where a non-programmer can outline the correct process for a task, which the system can then translate into reward functions or fine-tuning data. Anything that lowers the barrier to implementing a feedback loop in a new domain will be valuable.

The age of reasoning is here

We’re now creating software that doesn’t just know things but can figure them out on its own. Builders who embrace the reasoning + RL paradigm will create the next generation of standout AI products. Those who stick with RAG may wake up to find their product feeling outdated, like a know-it-all who can’t solve a puzzle.

As services become software, the ability to solve the puzzle is where the true value lies.