On-Premise AI Infrastructure | Single Binary Deployment | Lexiane
On-premise AI in a single static Rust binary. No Python, no Docker, no package manager. Air-gapped, cloud, or hybrid deployment. Kubernetes-ready.
Deploying an artificial intelligence system in production is not only a question of model. It is a question of infrastructure: how the binary installs, what dependencies it introduces, what attack surface it creates, what resources it consumes, how it behaves under load, and how it integrates into your existing environment.
Most AI frameworks answer these questions after the fact — performance, security, and operability are constraints to manage once the tool is chosen. Lexiane starts from the opposite principle: infrastructure properties are architectural decisions, made upfront, and non-negotiable in production.
The infrastructure problem that Python structurally imposes
The dominant RAG frameworks — LangChain, LlamaIndex, Haystack — are written in Python. Their functional richness is real. But the infrastructure choices they impose are structural constraints that neither versions, nor optimizations, nor best practices can erase.
The interpreter and its environment. Python requires an interpreter, a package manager, a virtual environment, and a set of system dependencies. A minimal Docker image for a Python RAG stack is between 500 MB and 1 GB. The cold start of a Python service with its dependencies is measured in seconds. In an air-gapped environment, installing pip is an operational security problem — not a development inconvenience.
The GIL and concurrency. Python’s Global Interpreter Lock serializes parallel operations at the interpreter level. A Python RAG service processing multiple simultaneous requests introduces contention that neither threads nor processes resolve cleanly — threads block each other, processes multiply memory consumption.
Pipeline latency. Published measurements place the overhead of a Python RAG pipeline between 10 and 22 ms per call, excluding inference time. On high-volume workloads, this friction accumulates and becomes an architectural bottleneck.
Absence of temporal determinism. Python delegates memory management to a garbage collector. In production, unpredictable GC pauses — from a few tens to a few hundred milliseconds — appear under load. In an embedded system, a real-time system, or any system subject to a latency SLA, this behavior is unacceptable.
Rust eliminates each of these constraints by design. Lexiane inherits them as fundamental properties — not as added optimizations.
One binary: what this means concretely
Lexiane compiles into a self-contained static binary. A single executable file contains the entire system: the pipeline engine, the HTTP server, the SSE streaming interface, the document parser, the local embeddings engine (Candle), the local LLM inference engine (Mistral.rs), the in-memory vector database, and the configuration interface.
This choice is not an implementation constraint. It is an architectural decision with direct operational consequences.
Trivial deployment. Copy a binary and a configuration file: that is all a Lexiane deployment requires. No package manager, no virtual machine, no secondary process to orchestrate. The system is operational within seconds.
Minimal attack surface. Every external dependency is a potential attack vector. A static binary with no dynamic dependencies does not expose a surface linked to shared system libraries, versioned packages, or ancillary processes. The attack surface is delimited by the binary itself — auditable, fixed, reproducible.
Deployment reproducibility. The same binary, compiled from the same source code with the same compiler, produces the same behavior regardless of the target environment. There is no drift due to different dependency versions across development, test, and production environments.
Minimal footprint. Without a Python runtime, without a virtual machine, without an execution environment, the memory footprint at startup is a fraction of an equivalent Python stack. Docker images built around the Lexiane binary are in the order of a few tens of megabytes — not a few hundred.
Three deployment modes, one architecture
Lexiane’s à la carte architecture adapts to your existing infrastructure — not the reverse. Three configurations cover all operational constraints encountered in production environments.
Air-gapped — total sovereignty, zero network dependency
In its air-gapped configuration, Lexiane makes no outbound network calls. The entire pipeline executes locally: document parsing by the native Rust parser, embedding computation by Candle, LLM inference by Mistral.rs, vector storage in local infrastructure.
This mode is designed for environments where network connectivity is architecturally absent — classified networks, SCIFs, sovereign datacenters, isolated industrial sites, embedded systems. It is not a degraded mode: it is the reference mode for which Lexiane was first architected — and the foundation of sovereign private RAG.
Infrastructure requirements:
- A Linux server (x86_64 or ARM64) with resources appropriate to the document volume
- An optional GPU for local inference acceleration (significant for throughput, non-blocking in its absence)
- Models pre-downloaded at deployment time — no downloading at runtime
What you deploy: a binary + a TOML configuration file + model files. Nothing else.
Cloud — integration with the best models on the market
In cloud configuration, Lexiane activates its external adapters via environment variables: OpenAI or Anthropic for generation, OpenAI for embeddings, Qdrant or pgvector for vector storage, Cohere for reranking.
The processing pipeline remains identical to the air-gapped configuration. Only the adapters change. If you replace OpenAI with Anthropic, or migrate from pgvector to Qdrant, the pipeline does not change by a single line — the configuration changes, not the code.
This property is fundamental for operational resilience: dependency on a specific provider is localized in an adapter, not diffused across the entire system. A provider that changes its pricing, deprecates a model, or experiences a service interruption is replaceable without refactoring.
Supported integrations:
| Role | Providers |
|---|---|
| Generation (LLM) | OpenAI (GPT-4o, GPT-4) · Anthropic (Claude) |
| Embeddings | OpenAI (text-embedding-3) |
| Vector store | Qdrant · pgvector (PostgreSQL) |
| Reranking | Cohere |
| Sparse search | Tantivy (BM25, local) |
Hybrid — data sovereignty, cloud power
The hybrid configuration is the most common in environments that combine data confidentiality constraints and generation quality requirements.
The principle: sensitive operations — parsing, embedding computation, vector storage — execute locally. Text generation is delegated to a cloud model, but only on extracted context fragments — never on source documents. Your files do not leave your perimeter. The cloud LLM receives excerpts, not your document base.
This partitioning is made possible by Lexiane’s stage-based architecture: each stage is an independent component whose execution can be localized differently. The boundary between local and cloud is a configuration line — not a code boundary.
The internal infrastructure: what Lexiane natively embeds
HTTP server and SSE streaming
Lexiane embeds a native HTTP server built on Axum, the Rust web framework developed by Tokio. This server exposes a standard REST API for document ingestion and querying, as well as an SSE streaming (Server-Sent Events) interface for real-time token-by-token response delivery.
No external web server is necessary. Lexiane can be placed directly behind a reverse proxy (Nginx, Caddy, Traefik) without additional configuration. It also operates in library mode — integrated into an existing Rust program, without HTTP server.
Storage layer
Vector storage. Three options according to deployment and infrastructure constraints:
- SQLite in-memory — for embedded deployments or prototypes. Zero infrastructure, zero network latency, zero administration.
- pgvector — PostgreSQL extension for environments that already have PostgreSQL infrastructure. The vector index coexists with your relational data in the same cluster.
- Qdrant — dedicated vector database for large corpora requiring optimized indexing and retrieval performance.
MetadataStore. A persistent SQLite register records all ingested documents, their collections, their metadata, and operation history. Schema migrations are versioned and applied automatically at startup — without manual intervention, without service interruption.
GraphStore. For deployments activating GraphRAG mode, Oxigraph ensures persistence of the RDF triplestore. The knowledge graph extracted from documents is stored durably, queryable via SPARQL queries, and incrementally enriched with each new ingestion.
Cache and performance
The LRU cache decorator (CachedEmbeddingModel, CachedLLMEngine) can be activated on any embeddings or generation adapter. It maintains an in-memory cache of recent queries — particularly effective for frequent query embeddings and responses to recurring questions.
This decorator is not coupled to a specific adapter: it wraps any compatible component, locally as well as in the cloud. Activating or deactivating the cache does not touch the rest of the pipeline.
Hybrid search and reranking
The Tantivy sparse index (BM25) is natively embedded in the binary. Hybrid search — dense vector and sparse lexical — executes locally without additional infrastructure. Results from both search modalities are merged by Reciprocal Rank Fusion before reranking.
The reranking cross-encoder (Cohere in cloud configuration, or a local model in air-gapped configuration) reclassifies results by actual semantic relevance — beyond raw vector similarity.
Observability and operations
Instrumentation without coupling
The lifecycle hook system (PipelineHooks) enables instrumenting each pipeline stage without modifying its code. Four observation points are exposed: on_stage_start, on_stage_complete, on_stage_error, on_pipeline_complete. Each callback receives the stage name, its status, its duration, and structured metadata.
These hooks can feed any external monitoring system: Prometheus, Datadog, OpenTelemetry, Grafana, or an internal supervision system. Observability is a pipeline property, not a dependency on specific monitoring infrastructure.
Execution metrics
PipelineMetrics and StageMetrics expose aggregated timing data after each execution: duration of each stage, total pipeline duration, identification of stages in error. This data enables detecting performance regressions, identifying bottlenecks, and tracing system behavior evolution over time.
Token consumption tracking (UsageStats) is accumulated in the pipeline context. In cloud configuration, this data enables monitoring and budgeting API consumption in real time — per query, per document collection, or over defined time windows.
Healthcheck
The HealthCheck port exposes a system state verification endpoint — interoperable with container orchestrators (Kubernetes, Nomad) and load balancers. The endpoint returns the state of each active component: vector store, LLM, embeddings model. A degraded component is detectable before it affects users.
Performance characteristics
Lexiane’s performance properties are direct consequences of the choice of Rust — not one-off optimizations.
No garbage collector. Rust manages memory statically. There is no GC pause, no unpredictable latency spike under load, no non-deterministic temporal behavior. What is measured in load testing is what occurs in production.
Native concurrency. Rust’s concurrency model, based on async/await with the Tokio runtime, enables processing a large number of simultaneous requests without interpreter-level contention. Each request is processed in an independent asynchronous task — without GIL, without thread overhead.
Predictable memory footprint. Without runtime or execution environment, Lexiane’s memory consumption is deterministic and stable. It does not grow unpredictably under prolonged load.
Published measurements on Python to Rust migrations in production illustrate the order of magnitude of typically observed gains: Discord reduced its CPU consumption by 80% migrating a service from Python to Rust. Cloudflare obtained 25% additional throughput with a 50% reduction in CPU load. Dropbox divided server load by four on its critical paths. These figures are specific to their systems — but they illustrate a magnitude consistent with the structural properties of the language.
Integration into your existing infrastructure
Standard REST API
Lexiane exposes a documented REST API for document ingestion and querying. It is interoperable with any HTTP client — web applications, internal tools, integration systems, automation scripts. No proprietary SDK is required.
Library mode
Lexiane can be integrated directly as a library into an existing Rust program. In this mode, there is no HTTP server: the pipeline is instantiated and called programmatically, within the same process. This mode is suited to deep integrations in existing systems where the REST API would introduce unnecessary indirection.
Conversational sessions
The Lexiane server maintains persistent conversational sessions with history. Each session preserves the context of previous exchanges — subsequent responses can take into account previous questions without the client having to retransmit the complete history. In Agentic RAG mode, these sessions also incorporate the memory of successive reasoning iterations.
Four reference configurations ready to deploy
To accelerate integration, Lexiane comes with four operational reference configurations. Each is a complete compilable project — not a documentation example.
| Configuration | Use case |
|---|---|
| Air-gapped | Isolated network, no external connectivity, 100% local inference |
| Cloud | Maximum performance, OpenAI/Anthropic/Cohere integration, pgvector or Qdrant storage |
| Hybrid | Local embeddings, cloud generation, local storage — source data sovereignty |
| Generalist | Flexible starting point, adaptable to any context |
Each configuration includes the reference TOML file, documented environment variables, and explicitly listed dependencies.
Frequently asked questions from infrastructure and DevOps teams
Can Lexiane be deployed in a Docker container? Yes. The Lexiane static binary produces a Docker image of a few tens of megabytes — compared to the 500 MB to 1 GB typical of an equivalent Python stack. The image contains only the binary and configuration files. It is compatible with all standard container orchestrators.
Does Lexiane support Kubernetes? Yes. The healthcheck endpoint is interoperable with Kubernetes readiness and liveness probes. Deployment as a Deployment or StatefulSet (depending on storage mode) is standard. Configuration via environment variables is natively compatible with Kubernetes ConfigMaps and Secrets.
What is the minimum hardware configuration for a production deployment? In cloud mode (embeddings and inference offloaded), a server with 4 cores and 8 GB RAM supports significant document volumes. In air-gapped mode with local inference, requirements depend on the chosen LLM model — quantized 7B models run on CPU with 16 GB RAM, with a GPU for satisfactory generation performance.
How can updates be managed without service interruption? Database schema migrations are versioned and applied at startup — without manual intervention. For binary updates, blue-green or canary deployment is applicable in the standard way: the new binary initializes, applies any migrations, and enters service. Compatibility between consecutive versions is maintained.
Can Lexiane be deployed on ARM? Yes. The Lexiane binary compiles natively for x86_64 and ARM64. Deployments on ARM infrastructure — Graviton servers (AWS), Ampere, embedded hardware — are supported without modification.
Can the TOML configuration be managed by an IaC tool? Yes. Lexiane’s configuration is a static TOML file, versionable in Git, and manipulable by any infrastructure as code tool — Terraform, Ansible, Helm, or custom deployment scripts. There is no distributed configuration in an external database.
Let’s talk about your infrastructure.
Every deployment has its own constraints: network security policy, container orchestrator, secret management policy, high availability requirements, integration with existing systems. We do not offer a generic configuration for constraints that are not generic.
We offer a technical exchange on your environment, your operational constraints, and the Lexiane configuration that corresponds to them.
What you can expect:
- A response within 48 business hours
- A technical contact who knows the deployment constraints of regulated environments and production architectures
- An honest assessment of the fit — including if specific adapter development is necessary
→ Contact us
Request access to the Auditable Core
Sign up to be notified when our Core audit programme opens. In accordance with our privacy policy, your professional email address will be used exclusively for this technical communication, with no subsequent marketing use. Access distributed via secure private registry.
Contact us