Deep Research API: Moving Beyond Web Search to Internal Data
Deep research is the best thing to happen to search in years. Ask a question, get a cited report that synthesizes dozens of sources - in minutes, not hours. It’s genuinely powerful.
A deep research API thus lets you programmatically send a question and get back a cited, synthesized report—while behind the scenes, an AI agent iteratively searches, reasons over, and synthesizes information from a large corpus.
But today, it only works on the web.
If you’re building enterprise applications over internal data—contracts, claims files, earnings calls, regulatory filings—you probably realize that you need the same comprehensive and multi-hop retrieval pointed at your own documents. But the gap between “deep research on the web” and “deep research on your data” is wider than it looks.
How deep research APIs work on the web
While not widely publicized, all providers (OpenAI, Perplexity, Exa, Parallel) roughly share the same training and architecture setup (see Search-R1, Jin et al., 2025).
We’ll take a closer look at how this works under the hood, because understanding the web implementation is key to seeing what needs to change for internal data.
The think-search-answer loop
At a high-level, Search-R1 trains LLMs to interleave reasoning with search engine calls using RL. The architecture uses structured tokens to control the loop:
<think>: the model reasons about what it knows and what it still needs<search>: the model generates a search query and triggers retrieval<information>: retrieved passages get injected back into the context<answer>: the model produces its final response
A trace for a trained search agent might look something like this:
Query: "What are the leading theories on dark matter and what experimental evidence supports each?"
<think>I need to identify the main dark matter theories, then find experimental evidence for each. Start broad.</think><search>dark matter candidate theories overview</search><information>[3 retrieved passages about WIMPs, axions, MACHOs, modified gravity...]</information>
<think>I have the main candidates. Now I need experimental evidence. WIMPs first — LUX-ZEPLIN and XENON are the major experiments.</think><search>WIMP dark matter experimental evidence LUX-ZEPLIN XENONnT results</search><information>[2 retrieved passages...]</information>
<think>Good coverage on WIMPs. Now axions — ADMX is the main experiment.</think><search>axion detection experiments ADMX CASPEr</search><information>[2 retrieved passages...]</information>
<think>I have enough to answer comprehensively.</think>
<!-- ──── --><answer>[Structured report with citations]</answer>Critical note: the model here isn’t following a script. It learned when to emit <search> vs. <answer> tokens. Unlike standard RAG pipelines—which retrieve once and generate—deep research agents retrieve iteratively, refining their queries based on what they find. This is sometimes called agentic RAG, though deep research takes it further with RL-trained search policies. The model is generating search tokens when it decides it needs more information and answer tokens when it decides it has enough.
Why RL, not prompt engineering
You might be thinking, “why do we need RL when we can just give agents a search tool?”
Models, by themselves, are not great at searching. While they’re general purpose beasts, their baseline ability to know what to query, when to search again, and when to stop is not sufficient for production use-cases (more on that later). This is where reinforcement learning comes in to fill those gaps.
Now, what does this look like under the hood?
-
The reward is purely outcome-based. RL cares about the outcome. In search-r1, they use exact match on the final answer. It’s a binary: did the model get the right answer? After many loops, the model learns a policy of search strategy entirely from whether its final answers were correct.
-
The training loop: for each training question, the model generates multiple complete trajectories (think-search-answer sequences). Each trajectory then gets scored on just the final answer (as mentioned previously). The policy gradient pushes the model toward trajectories that produced correct answers and away from ones that didn’t.
-
Retrieved token masking is essential. During back-propagation, gradients only flow through tokens the model generated—its reasoning and search queries—not through the retrieved passages. Without this, the model learns spurious correlations with the content of specific retrieved documents instead of learning how to search effectively.
What the model learns in practice:
- When to continue vs. when to stop. For example, there’s typically diminishing returns: a 4th search on the same subtopic rarely adds information the first 2 didn’t cover.
- How to reformulate. If a query returns irrelevant results, the model learns to broaden, narrow, use synonyms, or try different phrasings. NOTE: this is dependent on the search index BUT an emergent behavior from the RL training.
- How to decompose complex questions. Multi-hop questions get broken into sub-queries: the model learns decomposition strategies that produce better final answers for different question types. Again - emergent behavior.
- Explore vs. exploit. Should it keep digging into a productive line of inquiry or pivot to cover a different facet? Classic exploration-exploitation, and RL handles it naturally.
Conceptually, searching and reasoning across documents involves a ton of heuristics. Without RL, you’d be trying to encode these heuristics into prompts—which is brittle—rather than baking them into model weights (where they’re much more natural).
Instead of hand-coding decisions like “always do 3 search passes,” “try synonyms if recall is low,” “stop if the last search returned nothing new”, these heuristics are learned via the specific corpus and real query patterns.
Deep research on internal data: what’s different
Now naturally, you may wonder: can we use the same underlying deep-research API for internal data?
Turns out, no—but the architecture is largely the same: agent loop, query decomposition, RL-trained policy, synthesis with citations.
Next, we’ll go into how the retrieval substrate and operating constraints are different in ways that matter a lot for implementation.
Internal retrieval vs. web retrieval
Web deep research calls a search engine and reads web pages. Internal docs require an indexed corpus that looks very different from web pages.
This changes what the agent loop looks like. On the web, the agent navigates: it follows links, reads related pages, discovers new sources through hyperlinks. Over a private corpus, the agent searches: it queries the index with different formulations, retrieves from different metadata-filtered subsets, and combines results across document boundaries.
Furthermore, web data is often isolated and “complete”. Due to incentives around SEO, web data is often already well-groomed for information consumption. Internal data is often much, much messier.
One consequence of this is the search infrastructure for internal deep research needs to be optimized with the access pattern of doing lots of small, targeted queries rather than one big ranked list.
Domain vocabulary changes the RL problem
Web deep research models learn language from the internet. The web-deep-research RL policy is trained on web-scale data, so it knows how to search for things the way the internet talks about them.
Internal corpora use different language. “GC” means general contractor in construction, not garbage collection. “COL” is a cost-of-living adjustment in HR documents, not a database column. “RFI” is a request for information in construction, not a radio frequency interference issue.
While LLMs benefit from transfer learning, you still run into gaps when running inference on an unfamiliar dataset. Analogous to this is why LLMs suffer performance degradation with languages outside English.
The RL policy is the same way: for best performance, it needs to be trained on your specific corpus. The reward signal stays the same—answer quality—but the environment the policy navigates is fundamentally different.
A policy trained on web search will generate web-style queries that miss domain-specific terminology. A policy trained on your construction claims corpus will learn that searching for “change order” and “CO” should return the same documents.
Completeness requirements are stricter
There’s a lot of redundant information on the web, making it easier for deep research to triangulate. If one source is missing, there are usually others that state the same fact (even if perhaps slightly differently). This system can achieve high answer quality even with imperfect recall.
Internal corpora, unfortunately, do not often have this quality. A critical clause might appear in one paragraph of one contract—and it may be the only place mentioned out of 10,000 documents. If the retrieval system misses it, it’s omitted entirely—there’s no second source to help.
This shifts the RL reward function a bit. While web deep research optimizes for answer quality with an implicit assumption of source redundancy, internal deep research needs to optimize for coverage AND answer quality, which adds (needed) complexity.
Internal corpus is messier
Web pages are mostly clean HTML with well-defined structure (again, thank you SEO). Internal documents are different: PDFs with scanned pages, tables, multi-column layouts, headers and footers that repeat on every page, handwritten annotations, etc.
This matters because the search agent’s effectiveness is downstream of how well documents are parsed and indexed. On the web, a search query returns clean, self-contained pages—the agent can read them directly and move on.
With internal data, a single “document” might be a 200-page contract where the relevant clause is buried in a table on page 47. The agent needs to search across document boundaries, piece together information scattered across appendices and amendments, and reason over content that was never written to be easily found.
Architecture of an internal deep research API
Putting it together, an internal deep research system has four layers.
1. Retrieval substrate
The indexed corpus that the agent searches over.
This is the foundation. It needs to be fast (sub-200ms per query, since the agent issues many small, targeted queries rather than one big ranked list), accurate (parsing that preserves document structure rather than flattening it), and comprehensive (metadata filtering so the agent can scope searches to specific document types, date ranges, or sources).
2. The agent loop
The search-read-reason loop itself is effectively the same as web deep research. At each step, the agent issues a query, reads the returned passages, updates its working context, and decides what to do next. The differences are in what it searches over (the retrieval substrate) and how well it makes those decisions for your data (the RL-trained policy).
3. RL-trained policy
The function that makes those decisions. Trained via RL on your specific corpus, with answer quality and coverage as the reward signal. Over time, it learns things like:
- Which query formulations work for your documents’ vocabulary
- Which search strategies work for which question types
- When additional hops add value vs. when they’re redundant
- How to handle your corpus’s specific structure (e.g. cross-referencing between contracts and amendments or between claims and supporting documentation).
And this can be dynamic (improving over time). Every production query serves as a potential training signal.
4. Synthesis and citation
Finally, the accumulated evidence gets synthesized into a structured, cited response. Each claim traces back to an exact passage in an exact document. And before the final report gets returned, the agent also resolves conflicts where possible and flags them where it can’t.
For the caller, this can look like a single API call (with Charcoal):
import Charcoal from "charcoal";
const client = new Charcoal({ apiKey: "..." });
// searchconst result = await client.search({ objective: "Compare pricing strategies across companies in Q1 2022", context: "Which companies raised prices vs. absorbed inflation in Q1 2022? Include specific margin impacts and executive commentary from earnings calls."});
// reportconsole.log(result.synthesis);
// citations / findingsfor (const r of result.results) { console.log(r);}When you need a deep research API (and when you don’t)
Not everything requires deep research. This is important to stress—because it’s tempting to over-engineer.
You probably don’t need deep research if your queries are simple lookups (“What’s our refund policy?”), your corpus fits in a context window (~100k tokens; 1M token windows still don’t work well for this), or your users just want relevant documents returned rather than synthesized answers.
You would benefit from a deep research API if any of these is true:
- Your corpus is fairly large, and you have retrieval requirements that can’t be “shove everything into the context window”
- Your queries require reasoning across multiple documents: e.g.: comparisons, trend analysis, exhaustive coverage.
- If missing a relevant document is a real liability: e.g.: legal discovery, compliance, insurance claims.
- If your agents are doing multi-step analysis where each step depends on findings from the previous one (multi-hop retrieval).
- Or if you’ve tuned your RAG pipeline and complex queries still fail—it likely means the architecture, not the configuration, is the bottleneck.
Deep research is proven on the web. Your data is next.
Deep research has already changed how people do their jobs—synthesizing answers across dozens of sources in minutes instead of hours. The same approach applies to internal data. The retrieval substrate changes, the domain vocabulary changes, the completeness requirements tighten. But the core loop is the same.
Charcoal brings this to your internal data. We handle ingestion, train an RL-based retrieval policy on your specific corpus, and expose it through a single API. Getting started is straightforward: upload your documents, and the search API is ready to query: no chunking, rerankers, or pipeline assembly required.
Check out our docs, or talk to our engineering team if you’re building agents that need deep research over internal data.