Is agentic RAG compatible with private air-gapped RAG?

Yes. The agentic loop runs entirely locally, using a local LLM (Mistral.rs) as the decision engine. A 7B to 13B model is sufficient for the majority of documentary reformulation decisions. No external API call is required.

Can agentic RAG be restricted to certain users or query types?

Yes. QueryRouter allows selective activation of agentic mode based on user profile, query type, or document collection. Standard RAG can be used as the default, with agentic mode activated only for queries that require it.

How can an agentic session with an unsatisfactory result be debugged?

The audit trail records each iteration with its full parameters: reformulated question, transformation strategy applied, retrieved passages, relevance score, and decision taken. Complete reasoning reconstruction is possible from the SHA-256 chain, without any additional instrumentation.

Agentic RAG | Sovereign Document Reasoning

A classic RAG answers. An agentic RAG reasons before answering — and knows how to recognize when it must search further before reaching a conclusion.

This apparently simple distinction fundamentally changes what a document system can accomplish. It shifts the boundary between questions an AI can answer reliably and those that structurally elude it. And it raises architectural questions — about control, traceability, certification — that most agentic implementations do not treat seriously.

Lexiane integrates an agentic layer designed for production: functional, controllable, and architecturally separated from the certified kernel.

The structural limitation of classic RAG

A linear RAG pipeline works in a single pass: the user’s question is vectorized, the most similar passages are retrieved, and the LLM generates a response from this context. For the majority of direct documentary queries, this model is effective and sufficient.

But it rests on an implicit assumption rarely articulated: that the first retrieval is sufficient to produce a reliable response.

This assumption holds for simple, well-formulated questions. It breaks down in three common situations.

The question is broader than what the initial retrieval can cover. “Summarize the decisions made on project X between January and March” calls for dozens of passages scattered across meeting notes, exported emails, and status reports. A retrieval by semantic similarity returns passages closest to the question’s formulation — not necessarily the most relevant ones over the entire period.

The question is ambiguous or imprecise. The user knows what they are looking for, but does not have the exact technical vocabulary that would allow vector search to target the right passages. The first retrieval returns partially relevant results, but not those that would actually answer the underlying question.

The response requires cross-referencing multiple sources. The relevant information is present in the corpus, but scattered across dozens of documents that are not semantically similar to each other. No single-pass retrieval can aggregate them.

In these three cases, classic RAG produces a response — but a response based on insufficient context. Without a mechanism for evaluating retrieval quality, the system does not know it is answering poorly. It answers with the same apparent confidence, whether it retrieved ten perfectly relevant passages or three vaguely related ones.

Agentic RAG solves this problem by introducing a reasoning loop between retrieval and generation.

What an agentic RAG really is — and what it is not

Before detailing the architecture, a clarification is needed. The term “agentic” is used very broadly in the AI sector, often to describe systems that are in reality predefined-step workflows — a hardcoded sequence of operations, executed sequentially, without real decision-making at each step.

A true agentic system is distinguished by a fundamental property: it makes contextual decisions at each iteration, based on an evaluation of the current state — and these decisions can diverge depending on the retrieved content, not only depending on the workflow structure.

It is not a chatbot. A chatbot maintains a conversational history and generates contextualized responses — but it does not search, evaluate, or decide to reformulate its query.

It is not an advanced search engine. A search engine returns results according to a ranking algorithm. It does not generate a response, does not evaluate whether results are sufficient, and does not make decisions about what comes next.

It is not a fixed-step workflow. A predefined workflow always executes the same operations in the same order. An agent can traverse different paths depending on what it finds — reformulate twice if the first retrieval is insufficient, call an external tool if the documentary context is incomplete, abstain if no path produces a reliable context.

Lexiane’s agentic RAG is an orchestrator of RAG pipelines within a reasoning loop. At each iteration, it executes a complete pipeline, evaluates the result, and makes a decision on what follows — according to configurable rules and deterministic guards.

The reasoning loop: anatomy of an iteration

Step 1 — Transformation and retrieval

Each iteration begins with a retrieval phase. The current query — which may be the reformulated initial question, a decomposed sub-question, or a question enriched by context from previous iterations — passes through the complete retrieval pipeline.

Retrieval is not a simple vector search. Lexiane implements the state of the art in production retrieval:

Query transformation. Before any search, the QueryTransformer can apply several strategies according to configuration:

Query expansion — enriching the question with synonyms, related terms, and reformulations to cover passages that do not use the same words as the question.
HyDE (Hypothetical Document Embeddings) — generating a hypothetical document that would answer the question, vectorizing this document, and using its embedding for search. This strategy significantly improves semantic search precision on abstract or technical questions.
Sub-question decomposition — breaking the initial question into more targeted questions, each addressing a specific dimension of the expected answer.

Multi-query retrieval with RRF. The MultiQueryRetrievalStage generates N variants of the query, executes an independent retrieval for each, and merges results by Reciprocal Rank Fusion. The RRF formula — score(d) = Σ 1/(k + rank_i(d)) — produces a consolidated ranking that favors documents appearing in good position across multiple independent lists, without being dominated by a single relevance signal.

Hybrid search. Retrieval systematically combines dense search (semantic vector similarity) and sparse search (BM25, lexical matching). Documents relevant by meaning and documents relevant by exact terms are both retrieved — then merged and reranked by a cross-encoder.

Step 2 — Evaluation of the retrieved context

Once retrieval is complete, the agent evaluates the quality of the obtained context according to several criteria:

Overall relevance. The relevance gate (RelevanceGateStage) computes an aggregated confidence score over the retrieved passages. This score reflects how well the context aligns with the question asked.

Thematic coverage. The agent evaluates whether the retrieved passages cover the dimensions of the question — or whether certain dimensions are absent from the current context. A question that calls for a comparison between two entities, only one of which is represented in the retrieved passages, has an incomplete context.

Internal consistency. Contradictory passages on the same fact are a signal that retrieval has brought back conflicting information — which requires either supplementary retrieval to arbitrate, or explicit signaling of the contradiction in the response.

Step 3 — Decision

Based on this evaluation, the agent makes one of three decisions:

Answer. The context is sufficiently relevant, complete, and consistent. The generation pipeline is triggered with the consolidated context from successive iterations. The produced response is grounded in traced, cited, and verifiable sources.

Reformulate and re-run. The context is insufficient or partial. The agent reformulates the query using information drawn from already-retrieved passages to orient the new search. This reformulation can take several forms: direct reformulation of the question, decomposition into a sub-question targeting the missing dimension, or reformulation by expansion toward vocabulary identified in partially relevant passages.

Call an external tool. The documentary context is intrinsically incomplete for this query — not because retrieval is imperfect, but because the information is not in the corpus. The agent can call a configured external tool to enrich the context: querying a real-time data API, executing a computation, accessing a relational database, or calling a specialized service.

Step 4 — Loop control

Deterministic guards frame each iteration and can interrupt the loop independently of the LLM’s behavior:

Maximum number of iterations — the loop stops after N cycles, regardless of results obtained.
Maximum delay — a global time constraint on the agentic session.
Minimum relevance score — if context does not reach the required threshold after several reformulations, the system abstains rather than generating a poorly grounded response.
Security conditions — input and output guardrails operate at each iteration. A prompt injection detected at iteration N interrupts the loop at that point.

These guards are configurable, explicit, and inspectable rules. They do not depend on the LLM’s internal confidence threshold — whose calibration is opaque and variable across models.

The architectural decision that changes everything

The agentic layer outside the certified kernel

The most important architectural decision in Lexiane’s agentic module is not in what it does — it is in where it lives.

The agentic reasoning loop is not in the certified kernel. It orchestrates the kernel from outside, via its public interfaces, exactly as a human user would orchestrate pipelines manually — but at the speed of a program.

This separation is not an implementation detail. It is the principle that makes the system simultaneously capable and auditable.

Why the agentic loop cannot be in the certified kernel. Lexiane’s kernel executes deterministic pipelines. Give it the same inputs twice, it produces the same outputs. This is a fundamental property of a certifiable kernel — without it, tests prove nothing and audit can verify nothing.

The agentic loop is non-deterministic by nature. The LLM that decides to reformulate or answer is not an automaton — its decisions depend on the current context, its temperature, the session history. Two sessions with the same initial question may take different paths and arrive by different routes at equivalent responses.

Putting non-deterministic behavior in a certified kernel would make it uncertifiable. Lexiane separates them: the kernel remains deterministic, certifiable, auditable. The agentic layer remains non-deterministic, but bounded and controlled.

What this separation guarantees concretely.

The pipelines executed by the agent are exactly the same as those of classic mode — same stages, same ports, same assembly validation logic, same audit trail. The agent has access to no kernel functionality that is not exposed via its public interfaces.

Each pipeline triggered by the agent — each loop iteration — produces its own records in the SHA-256 chain. The complete sequence of decisions is reconstructible: why the agent reformulated at iteration 2, which passages it retrieved at iteration 3, why it finally decided to answer at iteration 4.

Non-deterministic behavior is contained in the agentic layer and bounded by deterministic guards. It cannot contaminate the kernel or alter its security properties.

What agentic RAG makes possible

Analysis of complex and voluminous dossiers

A tender response dossier, a regulatory market authorization dossier, a legal dispute file — these documentary sets are voluminous, heterogeneous, and require cross-referencing information scattered across dozens or hundreds of documents.

Agentic RAG can automatically decompose an analysis request into sub-questions, process them iteratively, and synthesize the results into a structured response. A question like “Identify the contractual risks in this supplier dossier” becomes a series of targeted searches: penalty clauses, termination conditions, service level commitments, litigation history — each treated as a distinct iteration, whose results are consolidated before the final synthesis.

Cross-referencing contradictory sources

Two reports on the same incident that diverge on the facts. Two versions of a regulatory procedure that contradict each other on a critical point. A standard and its implementing decree that are not perfectly consistent.

A classic pipeline chooses one context or the other based on vector proximity. The agent can identify the contradiction, retrieve both contexts in parallel, and formulate a response that explicitly signals the divergence — with precise references to the source documents of each version. This is a fundamental qualitative property for contexts where a response that masks a contradiction is worse than no response.

Large-scale extraction and aggregation

Extracting all contractual deadline dates from a corpus of 500 contracts. Identifying all equipment mentioned in 10,000 maintenance sheets with their last intervention date. Listing all decisions made in a steering committee on a given topic over 24 months.

These tasks require many passes of targeted retrieval and aggregation that single-pass generation cannot reliably produce on a complete corpus. The agent can iteratively process sub-sets of the corpus, consolidate partial results, and produce a coherent aggregated result.

Knowledge graph traversal

In GraphRAG configuration, agentic RAG has an additional tool: multi-hop traversal of the knowledge graph extracted from documents. Complex relational questions — “What are the links between this project, its suppliers, and documented quality incidents?” — can be resolved by a combination of vector retrieval and RDF graph traversal, each iteration enriching context from a different angle.

Conversational sessions with reasoning memory

The Lexiane server maintains persistent conversational sessions. In an agentic context, this memory goes beyond a simple exchange history: the agent can leverage the consolidated context from previous questions to orient its retrieval on subsequent questions. A dossier analysis session can extend across multiple exchanges, each building on the reasoning of previous exchanges — without the user needing to recontextualize with each question.

When to use agentic RAG — and when not to

Agentic RAG is not universally superior to classic RAG. It is more powerful for certain tasks, more costly for all, and introduces additional operational complexity. The right tool depends on the nature of the queries.

Criterion	Classic RAG	Agentic RAG
Direct, well-formulated questions	Optimal	Overcalibrated
Ambiguous or imprecise questions	Variable results	Significant improvement
Multiple sources to cross-reference	Partial results	Significant improvement
Corpus < 10,000 well-structured documents	Sufficient	Optional
Large, heterogeneous corpus	May miss passages	Recommended
Large-scale extraction and aggregation	Difficult on single pass	Designed for this
Strict latency constraint (< 2s)	Appropriate	Inappropriate (multiple iterations)
Certified environment, deterministic behavior	Certifiable	Not certifiable (agentic layer)
Constrained token budget	Economical	Multiple consumption

The practical rule: if your users predominantly ask direct questions on well-defined topics, classic RAG with multi-query retrieval covers the essential needs. If your use cases regularly involve complex analyses, multi-source cross-referencing, or large-scale extractions — agentic RAG is the appropriate mode.

Both modes coexist in Lexiane and use exactly the same underlying pipelines. Switching from one to the other is a configuration decision by query type, not a system migration.

Human control in the agentic loop

The question of human control over agentic systems is central — both for AI governance teams and for regulatory frameworks like the AI Act. A system that reasons autonomously must be observable, interruptible, and auditable.

Observability of each iteration

Each iteration of the agentic loop is recorded in the SHA-256 audit chain: question asked, reformulation strategy chosen, passages retrieved, decision made (answer / reformulate / tool), relevance score evaluated. The complete reasoning sequence is consultable after the fact — not only the final response.

This audit granularity allows a supervisor to understand why the system took a given path — and to identify cases where reasoning was suboptimal, to adjust loop parameters.

Deterministic guards as a control mechanism

The guards framing the agentic loop are not LLM parameters. They are configurable rules applied by the orchestrator code, independently of the language model’s decisions. Even if the LLM “decides” to keep reformulating, guards can interrupt the loop.

These guards represent the policy your organization has defined on system usage: maximum number of iterations, maximum delay, minimum relevance threshold to trigger generation. They are the materialization of human control within the loop.

Resource consumption tracking

Token consumption statistics (UsageStats) are accumulated over the entire agentic session and accessible after execution. In cloud configuration, this data enables monitoring and budgeting API consumption of a multi-iteration reasoning session — and detecting abnormally long or costly sessions.

Feedback loop

The FeedbackStore port enables users to evaluate responses produced by the agentic system. This feedback feeds a register exploitable for continuous improvement: identification of query types where agentic reasoning is insufficient, domains where retrieval quality is low, cases where automatic reformulation worsens results rather than improving them.

Performance and cost considerations

Agentic RAG consumes more resources than a classic pipeline — by definition, since it executes multiple pipelines where classic mode executes one. This reality must be integrated into deployment design.

Token consumption. Each loop iteration generates embeddings for reformulation, retrieves passages, and solicits the LLM for decision-making and possibly generation. On a cloud model, this translates into multiplied API costs compared to a classic pipeline. Iteration limit guards are the primary mechanism for controlling these costs.

Latency. The response time of an agentic session is the sum of response times for each iteration. A three-iteration session takes three times longer than a classic pipeline, plus the inter-iteration evaluation overhead. Agentic RAG is not suitable for use cases that impose a response latency of less than a few seconds.

Cost control strategies in production.

Complexity-based routing. Lexiane’s QueryRouter port enables classifying each query and routing it to the appropriate mode — classic for direct questions, agentic for complex questions. This routing significantly reduces average consumption, by reserving agentic mode for queries that genuinely need it.

Lightweight decision model. The decision to reformulate or answer can be entrusted to a less powerful (and less costly) language model than the generation model. Only the final generation solicits the top-quality model — intermediate iterations use an economical decision model.

Semantic cache. The SemanticCache port enables caching responses to queries semantically close to previous queries. A question already processed — or a very similar question — does not trigger a new agentic session: the response is returned directly from the cache.

Frequently asked questions

How does Lexiane determine that a reformulation is better than the previous one? Evaluation of retrieved context quality rests on the relevance gate score (RelevanceGateStage) and coverage metrics. The decision to reformulate is made when this score is below the configured threshold. The reformulation strategy — expansion, decomposition, HyDE — is determined by the agentic layer configuration and analysis of the partial context retrieved.

Can the agent modify data or trigger actions in external systems? Only actions explicitly configured as available tools. The agentic module has no access to functionalities not defined in its configuration. Available tools, their parameters, and their permissions are defined at assembly — not dynamically by the LLM. The agent cannot self-assign capabilities.

How do we guarantee that the agent does not head in an undesired direction on sensitive questions? Input and output guardrails operate at each iteration. A sensitive query is blocked by the InputGuardrail upon detection — not only on the initial question, but on each reformulation produced by the agent. A response that crosses content policies is intercepted by the OutputGuardrail before transmission. Deterministic iteration limit guards bound the duration of any reasoning.

Is agentic RAG compatible with private RAG (air-gapped)? Yes. In air-gapped configuration, the agentic loop executes entirely locally — with the local LLM (Mistral.rs) as the decision engine. The main constraint is the reasoning capability of the local model: a 7B-13B model is competent for most documentary reformulation decisions, but may show limitations on very complex reasoning. The hybrid configuration — agentic decision delegated to a cloud LLM on anonymized fragments — offers a compromise between reasoning power and source data sovereignty.

Can agentic RAG be limited to certain users or certain types of queries? Yes. Complexity-based routing (QueryRouter) enables activating agentic mode selectively — according to user profile, query type, or interrogated document collection. A standard user can be routed to the classic pipeline, while a senior analyst has access to agentic mode for their complex queries.

How do we debug an agentic session whose result is unsatisfactory? The audit trail records each iteration with its parameters: reformulated question, transformation strategy used, passages retrieved, relevance score evaluated, decision made. Complete reconstruction of the reasoning is possible from this chain — which enables precisely identifying the iteration where reasoning diverged and adjusting parameters accordingly.

Let’s talk about your complex use cases.

Agentic RAG brings the most value on specific documentary needs: large and heterogeneous corpora, questions spanning many sources, large-scale extraction and aggregation tasks. These needs vary significantly across organizations.

We offer an exchange on your concrete use cases — the questions your teams ask today that your document system answers poorly, the complex analyses that still require manual intervention, the corpora that resist classic search. And an honest assessment of what agentic RAG can bring — including if classic mode with multi-query retrieval covers most of your need at lower cost.

What you can expect:

A response within 48 business hours
A technical contact who knows agentic use cases in production and their real limitations
A configuration recommendation calibrated to your need — classic, agentic, or hybrid according to query types

→ Contact us

No commercial commitment. A conversation about your use cases.

Agentic RAG | Sovereign Document Reasoning | Lexiane