Can the Mistral.rs models be replaced by specialised models?

Yes. The LLMEngine port is an abstraction interface: any model compatible with Mistral.rs can be used, including fine-tuned or domain-specific models, without any modification to the pipeline. The swap is done at configuration level.

Private RAG & Local AI Assistant | On-Premise LLM

Organizations handling sensitive data face an apparent contradiction: the most capable artificial intelligence systems assume a connection to cloud services, while their regulatory, operational, or strategic constraints require that their data remain on-premise. Most solutions propose resolving this contradiction through a contract — a confidentiality clause, a data non-use commitment, a compliance label.

Lexiane resolves it through architecture.

In private RAG mode, the entire document processing — parsing, chunking, vectorization, LLM inference, storage, retrieval, generation — executes in a single binary, on your infrastructure, without a single byte of your documents crossing your network perimeter. This is not a policy. It is a physical property of the system.

What “private” really means

The AI market has progressively diluted the meaning of the word “private.” It is useful to re-establish precise distinctions, as they have concrete legal, operational, and regulatory consequences.

Cloud solution with confidentiality commitments. Your data is processed on a third-party provider’s infrastructure — in their geographic zone, on their servers, by their models. The provider contractually commits not to use your data to train their models, to retain it in a defined region, to delete it on request. The guarantee rests on trust in contract compliance, on the ability of audit to detect a violation, and on the stability of terms of use over time.

On-premise solution with external inference calls. Infrastructure is in your datacenter. But the language model is hosted on an external API — OpenAI, Anthropic, or a cloud service from your solution provider. Your documents are chunked and vectorized locally, but context fragments are transmitted to the external LLM at each generation. Data does not reside with a third party, but it transits there with each query.

True air-gapped — no outbound flows. Infrastructure is within your perimeter. The language model runs within your perimeter. Embeddings are computed within your perimeter. Vector storage is within your perimeter. There is no outbound network call — not because a firewall blocks it, but because the system structurally makes none. Your data cannot leave your perimeter, even in the event of a firewall misconfiguration.

Lexiane’s private RAG is the third category. The guarantee is not contractual. It is architectural — and verifiable.

The complete local stack

A truly private RAG requires that every pipeline component has a local implementation. It is not sufficient to store data locally if inference calls an external service. It is not sufficient to have a local LLM if embeddings are computed via a cloud API. Lexiane is the only RAG engine that compiles the entire stack into a single binary.

Local LLM inference — Mistral.rs

Mistral.rs is a high-performance LLM inference engine written in Rust, compiled directly into the Lexiane binary. There is no parallel Ollama process, no separate vLLM server, no internal HTTP call — inference is in the binary, on the same footing as the rest of the pipeline.

Mistral.rs supports models from the Mistral family, LLaMA 3, Phi, and their quantized derivatives (GGUF, GGML). Quantization enables running 7B to 13B parameter models on servers without a dedicated GPU — with generation performance satisfactory for the majority of documentary use cases. With an NVIDIA or AMD GPU, the same models reach generation throughputs comparable to standard cloud APIs.

The choice of model is a configuration decision, not a code decision. Changing the local model does not modify the pipeline — it modifies the TOML file and the model files loaded at startup.

Embedding computation — Candle

Candle is Hugging Face’s machine learning framework, written in Rust, compiled into the same binary. It generates vector representations of documents and queries entirely locally. Embedding models — E5-multilingual, BAAI/bge, or any compatible model — are loaded from the local filesystem.

Local embedding generation has an operational advantage often overlooked: it is deterministic and stable. Cloud embedding models can be updated unilaterally by their provider, which invalidates previously computed embeddings and requires complete corpus reindexing. With Candle, the model is fixed in your infrastructure — it evolves when you decide, not when your provider publishes an update.

Native document parsing

Lexiane’s document parser is pure Rust. It calls no Python library, no external service, no secondary process. PDF, Excel (.xlsx, .xls, .ods), PowerPoint (.pptx), HTML, Markdown, plain text: all formats are processed in the same binary, by the same process, without network.

Local vector storage

Three local vector storage options according to volume constraints and existing infrastructure:

SQLite — for embedded deployments or moderate-sized corpora. Zero additional infrastructure, zero network latency, zero administration. The vector store is a file on your filesystem.

pgvector — PostgreSQL extension for organizations that already have a PostgreSQL cluster within their perimeter. The vector index coexists with your relational data in the same cluster — one infrastructure to administer, back up, and audit.

Qdrant — dedicated vector database for large corpora requiring indexing and retrieval performance optimized at scale. Deployed in your infrastructure, it remains within your perimeter.

Local hybrid search — Tantivy

The Tantivy sparse index (BM25) is embedded in the binary. Hybrid search — dense vector by semantic similarity, and sparse lexical by term matching — executes entirely locally. No external search infrastructure (Elasticsearch, OpenSearch) is required.

Fusion of the two modalities by Reciprocal Rank Fusion ensures that neither semantic matches nor exact lexical matches are missed — without network calls.

What you actually deploy

A static binary Linux. A TOML configuration file. Pre-downloaded model files. That is all.

No Python interpreter. No package manager. No virtual machine. No secondary process. No service discovery. No container registry to contact. The system is operational in a completely isolated network, without any internet access, from the first startup.

Layers of data protection

Local data residency is necessary but not sufficient. Lexiane adds several layers of protection that operate on data even within the local perimeter — against internal leaks, unauthorized access, and system behaviors that could expose sensitive information.

PII filtering before any vectorization

The personal data filter operates in first position in the ingestion pipeline — before semantic chunking, before embedding computation, before indexing. Personal data detected in your documents is processed according to policies you define by category:

Category	Example	Available policy
Email addresses	`john.doe@company.com`	Masking `[EMAIL]` · Deletion · Hashing
Phone numbers	`+1 555 234 5678`	Masking `[PHONE]` · Deletion · Hashing
IBAN	`DE89 3704 0044 0532 0130 00`	Masking `[IBAN]` · Deletion · Hashing
Social security numbers	`1 85 04 75 123 456 78`	Masking `[SSN]` · Deletion · Hashing
IP addresses	`192.168.1.42`	Masking `[IP]` · Deletion · Hashing

Typed masking preserves the type of information — which maintains the document’s semantic coherence for search — while rendering the value inaccessible in the vector store, in generated responses, and in logs.

The policy applied is recorded in the audit trail for each document processed.

Document access control before generation

In a deployment shared among multiple teams or multiple sensitivity levels, the question of who can access what arises at the retrieval level — not only at the interface level.

The AccessControl port filters retrieval results according to the requesting user’s rights before context is transmitted to the LLM. A document a user does not have access to is not transmitted as generation context — even if it is present in the vector store and semantically relevant to the query.

This position in the pipeline is critical: access control applied only at the user interface leaves confidential documents passing through the language model. An LLM that has received a document in its context can reveal its content indirectly, even if the response does not seem to directly reference it. Lexiane cuts this vector upstream.

Two access control models are supported:

RBAC — rights are defined by the user’s role in the organization
ABAC — rights are defined by document attributes: classification level, owning department, publication date, project scope

SHA-256 audit trail — under your control

The cryptographic audit chain records every pipeline action in your infrastructure — not in an external logging service, not with a third-party provider. The register belongs entirely to you.

Each entry is signed by the SHA-256 hash of the previous one. Any retrospective modification breaks the chain and is mathematically detectable. In the event of an incident — unauthorized access, out-of-scope query, injection attempt — complete forensic reconstruction is possible from the chain: who accessed what, at what time, with what result.

Input and output guardrails

Mechanisms protecting against prompt injection (InputGuardrail) and validating responses (OutputGuardrail) operate entirely locally. A malicious request is blocked before soliciting the local LLM. A response that would incorporate sensitive data or exceed the defined scope is intercepted before reaching the user. None of this processing requires a network call.

Who private RAG is for

Defense and intelligence

Defense and intelligence organizations operate in environments where data confidentiality is not relative — it is absolute. A classified document that transits through a cloud service, even momentarily, even encrypted, constitutes a potential violation of operational security rules. The question is not whether the provider is trustworthy. It is that the transit itself is unacceptable.

Lexiane deploys in a completely isolated network — SCIF, classified network, sovereign datacenter — without any connectivity requirement. Analysts query their sensitive document corpora with the capabilities of a production LLM, without a single piece of data crossing the security perimeter. The SHA-256 audit trail records every access with cryptographic traceability satisfying the most stringent traceability requirements.

Healthcare and medical devices

Health data is subject to the strictest protection regulations — GDPR, HDS (Health Data Hosting) certification in France, European health data directives. These regulations impose not only data localization but certification of hosting providers and processing operations.

A healthcare establishment or medical device manufacturer wishing to deploy a documentary assistant on patient records, clinical trial data, or pharmacovigilance documents cannot rely on a non-HDS-certified cloud API. Lexiane’s private RAG processes this data entirely locally — in your infrastructure, under your processing responsibility, without an interposed third-party provider.

The certification dimension is also relevant: IEC 62304 Ed. 2, whose publication is expected for August 2026, will introduce requirements on software embedding AI. Lexiane is the only RAG engine designed to meet this certification standard — with a #![forbid(unsafe_code)] kernel and Ferrocene compatibility.

Finance and central banks

Financial institutions are subject to data localization obligations, decision traceability requirements, and operational resilience mandates — GDPR, DORA, national prudential regulations. Entrusting the processing of sensitive internal documents to an external cloud LLM is not merely a matter of preference: it is often a regulatory compliance matter whose non-respect engages management liability.

Lexiane’s private RAG enables deploying a documentary assistant on regulatory corpora, internal procedures, risk reports, credit dossiers — entirely locally, with cryptographic traceability of every access, and PII filtering that protects customer personal data before any vectorization.

Public sector and administrations

Public administrations face growing requirements for digital sovereignty — NIS2, GDPR, orientations toward SecNumCloud-qualified solutions. Processing citizen data, sensitive documents, or information subject to professional secrecy on foreign cloud infrastructure raises legal and strategic questions that administrations can no longer ignore.

An air-gapped deployment of Lexiane responds to these requirements by nature: there are no data flows to a third-party provider, no dependency on cloud infrastructure, no risk of data transfer outside national territory. Digital sovereignty is not a declared policy — it is a physical property of the deployment.

Industry and embedded systems

Industrial environments share with classified environments a structural constraint: the frequent absence of permanent network connectivity. An isolated production site, an offshore platform, equipment embedded in a vehicle or aircraft — these systems cannot depend on a cloud API to function.

Lexiane runs as a static binary without network dependencies. It can respond to queries about technical manuals, maintenance procedures, product knowledge bases — in a vehicle, on a production line, on isolated industrial equipment. Its absence of a garbage collector guarantees deterministic temporal behavior, compatible with real-time system requirements.

What private RAG changes for your teams

For your CISO

The attack surface related to data processing is reduced to your physical perimeter. There are no outbound data flows to monitor, no external API to audit, no third-party provider whose security policy must be verified. The risk map for the AI system is delimited by your existing infrastructure.

For your DPO

GDPR compliance does not rest on a contract with a processing subcontractor. It is guaranteed by architecture: personal data cannot leave your perimeter. The processing register reduces to your own systems — no transfer declaration, no Article 28 agreement with a cloud AI provider, no risk of non-EU transfer linked to inference.

For your auditors

Proof of data confidentiality is architectural, not contractual. An auditor can verify, by inspection of the system configuration, that no external network adapter is activated. The SHA-256 audit chain proves that every document was processed according to defined policies. PII filtering is recorded for every ingested document.

For your CTO

A single binary to deploy, maintain, and audit. No separate inference stack, no external embeddings service, no synchronization pipeline between distributed components. The reduction in operational complexity is directly proportional to the reduction in attack surface.

What you give up choosing private RAG — and how to respond

Every architectural decision has trade-offs. Transparency about these trade-offs is necessary for an informed choice.

The reasoning capability of the best cloud models. GPT-4o, Claude Opus, Gemini Ultra: leading models from major providers offer reasoning capabilities that local 7B-13B models do not achieve for all tasks. For direct documentary questions, summaries, structured extractions — local models are fully competent. For complex reasoning tasks or synthesis of very long document chains, the difference can be perceptible. For these complex analyses, Agentic RAG offers a local alternative: by multiplying targeted retrieval passes, it partially compensates for the reasoning gap without resorting to a cloud model.

Response: Lexiane’s hybrid configuration allows retaining embeddings and storage locally — source data never leaves — while delegating generation to a cloud LLM on anonymized context fragments. Your raw documents remain in your perimeter. The cloud LLM receives excerpts.

Generation speed without a dedicated GPU. A quantized 7B LLM on CPU generates between 5 and 15 tokens per second depending on hardware — perceptible on long responses, acceptable for standard documentary queries. With an NVIDIA or AMD GPU, the same model reaches 40 to 80 tokens per second.

Response: For deployments where generation latency is critical, a GPU is recommended. For asynchronous use cases — batch extraction, corpus analysis, deferred generation — CPU is sufficient.

Model updates. Cloud models are updated automatically by providers — which regularly brings performance improvements. Local models evolve when you decide to update them — which is an operational constraint, but also a guarantee of behavioral stability.

Response: The open-source model ecosystem (Mistral, LLaMA, Phi) is progressing rapidly. Updating a local model translates into a file replacement and service restart — without pipeline modification, without corpus reindexing.

Deploying your private RAG

The air-gapped reference configuration

Lexiane comes with a complete, compilable air-gapped reference configuration — a real project, not a documentation example. This configuration includes the reference TOML file, documented environment variables, explicitly listed dependencies, and model pre-download instructions.

Migrating from cloud to private RAG

Lexiane’s à la carte architecture makes this migration structurally simple. If you started with a cloud configuration — OpenAI for embeddings and generation — migrating to private RAG translates into replacing cloud adapters with their local equivalents in the configuration file. The pipeline does not change. The business logic does not change.

The only substantial operation: recompute embeddings for your corpus with the local model, since OpenAI embeddings and Candle embeddings are not comparable. This reindexing is a plannable operation, without service interruption on the cloud version during the transition.

Hardware requirements

Configuration	CPU	RAM	GPU	Use case
Embedded / edge	4 ARM64 cores	8 GB	No	Corpus < 10,000 documents, occasional queries
Server without GPU	8 x86_64 cores	32 GB	No	Medium corpus, asynchronous generation acceptable
Server with GPU	8 x86_64 cores	32 GB	NVIDIA 16 GB VRAM	Large corpus, real-time generation
Existing infrastructure	Your PostgreSQL cluster	—	Depending on load	pgvector integrated into your stack

Frequently asked questions

Can we guarantee that no logs or telemetry leave the perimeter? Lexiane embeds no telemetry mechanism. There is no home call, no usage metric collection, no error reporting to an external service. Application logs pass through the tracing framework — configurable, filterable, and directed to your internal collection systems. No data emission toward the outside is possible in air-gapped configuration.

Can Mistral.rs models be replaced with proprietary or specialized models? Yes. The LLMEngine port is an abstraction interface. Any model compatible with the formats supported by Mistral.rs can be used. If your organization has trained or fine-tuned a specialized model on your domain — law, medicine, engineering — it can replace the default model without pipeline modification.

How are security updates for models managed in an air-gapped environment? Models are static files loaded at startup. An update translates into a file replacement on your infrastructure — plannable, reversible, without external connectivity. For updates to the Lexiane binary itself, the process is identical: transferring the binary via the secure channels of your software update policy.

Does private RAG support response streaming? Yes. The integrated HTTP server exposes an SSE (Server-Sent Events) interface that transmits responses token by token — including in local inference mode. The user experience is comparable to a cloud API in terms of perceived fluidity.

How can Lexiane be integrated in an air-gapped environment that does not allow unsigned binaries? Lexiane can be compiled from its source code within your own build chain, in your perimeter, with your qualified toolchain — including Ferrocene if your certification policy requires it. The produced binary is signed by your own code signing infrastructure, according to your internal policies.

Can Lexiane be used as a pure data processing pipeline, without a conversational interface? Yes. Lexiane can be deployed without a generation interface — only for ingestion, PII filtering, vector indexing, and knowledge graph construction. The processing pipeline is independent of the generation layer. This is the appropriate mode for building a structured documentary base, before deciding how to query it.

Let’s talk about your perimeter.

Every private RAG deployment has its specific constraints: data classification, applicable compliance framework, existing infrastructure, document volume, performance requirements. We do not offer a standard configuration for constraints that are not standard.

We offer an exchange on your concrete environment — your data, your infrastructure, your regulatory obligations — and the private RAG configuration that corresponds to them.

What you can expect:

A response within 48 business hours
A technical contact who knows the constraints of air-gapped environments, regulated sectors, and software certification
An honest assessment of the fit between your need and Lexiane’s private RAG — including if the hybrid configuration is more relevant for your case

→ Contact us

Private RAG & Local AI Assistant | On-Premise LLM | Lexiane