AI Data Engineering | Sovereign Document Processing | Lexiane
Sovereign document processing pipeline: native Rust parsing, semantic chunking, PII filtering, GraphRAG extraction, SHA-256 audit chain. Zero cloud dependency.
Lexiane is an end-to-end document processing pipeline, designed for organizations that cannot leave their data in the hands of a third party. Ingestion, parsing, semantic chunking, personal data detection, enrichment, vector indexing, cryptographic audit: every step executes in a single binary, on your infrastructure, without any network call.
The problem your data poses to most AI solutions
RAG platforms and AI tools on the market place you before a structurally unfavorable choice: send your documents to a third-party vendor’s cloud, or forgo artificial intelligence.
This choice is presented as a technical trade-off. It is in reality a transfer of risk — legal, regulatory, strategic. Your internal procedures, your contracts, your patient data, your financial reports, your technical specifications: the moment they leave your perimeter, you lose control of what happens to them.
Lexiane starts from the opposite principle: the processing of your data happens where it is, with the guarantees you have defined — not those your vendor allows you.
A complete document processing pipeline, with no external dependencies
Native parsing of your document formats
The first link in quality data processing is the ability to read your documents as they are, in their production formats. Lexiane’s parser is written in pure Rust — no Python dependency, no third-party service, no network call.
Natively supported formats:
| Format | Typical use cases |
|---|---|
| Reports, contracts, specifications, regulatory dossiers | |
| Excel (.xlsx, .xls, .ods) | Data tables, budgets, inventories, reference repositories |
| PowerPoint (.pptx) | Presentations, training materials, strategic slides |
| HTML | Intranet pages, wiki exports, web documentation |
| Markdown | Technical documentation, knowledge bases, structured notes |
| Plain text | Notes, exported emails, logs, semi-structured data |
A single binary reads, parses, and indexes your documents. No Python interpreter to maintain, no secondary server to operate, no additional attack surface.
Semantic chunking with configurable granularity
The quality of document processing depends not only on what you read — it depends on how you cut it. Poor chunking produces fragments that cut ideas mid-sentence, separate a question from its answer, or break the coherence of a table.
Lexiane’s chunking engine operates with configurable precision:
- Size and overlap adapted to the nature of your corpus
- Respect for linguistic boundaries down to the Unicode grapheme — your documents in French, Arabic, Chinese, or Japanese are chunked correctly
- Parent-child hierarchy: each fragment retains a reference to its parent context, retrievable at generation to return the complete passage
- Recursive semantic chunking: the system respects the document structure — paragraphs, sections, lists — rather than mechanically counting characters
The result: fragments that make sense independently, precisely indexable, contextualizable at retrieval.
Automatic enrichment before indexing
Each document fragment goes through an enrichment step before vectorization. The goal: increase retrieval quality by adding to each segment the metadata that makes it more precisely findable.
Applied enrichments:
- Token and word count of the segment
- Automatic extraction of representative keywords
- Segment summary for hybrid search
- Augmented content (parent document context injected into the chunk)
- Traceability identifiers (source document, position, content hash)
These enrichments are an integral part of the ingestion pipeline — they apply to each document from the first indexing, without any manual step.
Knowledge graph extraction (GraphRAG)
For corpora rich in relationships — regulatory documents, project archives, business knowledge bases, audit reports — vector search alone is not enough. It retrieves similar passages. It does not understand the links between the entities mentioned within them.
Lexiane’s GraphRAG engine automatically extracts knowledge triplets from your documents — subject, predicate, object — and stores them in a persistent RDF triplestore. The resulting base understands relationships between people, organizations, projects, dates, and regulations.
What this makes possible:
“Which suppliers are mentioned in the 2023 audits AND in active contracts?”
“Which projects are linked to this manager and to which regulation?”
“Identify the dependency chains between components mentioned in these 500 technical datasheets.”
Multi-hop traversal of the graph produces information that vector search alone cannot structurally reach.
Personal data protection by architecture
PII filtering integrated into the pipeline
Lexiane’s PII (Personally Identifiable Information) filter operates before any vectorization, any indexing, and any language model call. No sensitive data reaches your vector store or your LLM without having been processed according to your rules.
Personal data detected:
| Data type | Examples |
|---|---|
| Email addresses | john.doe@company.com |
| Phone numbers | National and international formats |
| IBAN and bank details | FR76 1234 5678 9012 3456 7890 189 |
| Social security numbers | National and European formats |
| IP addresses | IPv4 and IPv6 |
| Configurable identifiers | According to your business reference framework |
Configurable processing policies:
- Typed masking — replacement with a semantic placeholder
[EMAIL],[IBAN],[PHONE]: the type of information remains readable, the value disappears - Deletion — complete removal of the value from the fragment
- Hashing — replacement with the cryptographic fingerprint of the value: allows referential consistency without exposing the data
This architecture guarantees GDPR compliance by construction rather than by process: the data cannot reach the storage system before being processed. This is not a best-practice rule. It is a mechanical constraint of the pipeline.
Traceability and cryptographic audit at every step
An inviolable SHA-256 integrity chain
Every action in the processing pipeline is recorded in a cryptographic audit chain. Each entry is signed by the SHA-256 hash of the previous one — any subsequent modification of an event is mathematically detectable.
Events traced in the chain:
- Document submitted for ingestion (identifier, content hash, timestamp)
- Fragments created and their chunking parameters
- Embeddings generated (model, dimension, date)
- Entities extracted for the knowledge graph
- Personal data detected and policy applied
- User queries and documents consulted
- Responses produced and their sources
This is not a logging feature. It is a structural integrity guarantee: you can prove at any time that processing occurred exactly as recorded, and that no record was modified after the fact.
For an auditor, a compliance officer, or a regulator, this chain constitutes independent technical proof of your declared processes.
Output quality evaluation
Measured metrics, not assumptions
The quality of a document processing pipeline is not decreed at installation — it is measured in production, on your real data. Lexiane integrates RAGAS evaluation metrics at pipeline output:
- Faithfulness: is the produced response grounded in the retrieved sources?
- Relevance: do the retrieved sources actually answer the question asked?
- Context precision: are the retrieved fragments specifically relevant?
- Context recall: did the pipeline retrieve all available information?
Input guardrails detect prompt injection attempts and out-of-scope queries before they reach the pipeline. Output guardrails verify the produced response before transmission to the user.
The relevance gate evaluates the overall confidence score of the retrieved context. If the sources are not sufficiently reliable to produce a grounded response, the system abstains — rather than generating a poorly grounded response. This is the opposite of hallucination: a system that knows when it does not know. For cases requiring multiple retrieval iterations, Agentic RAG automates this process.
Lexiane as an autonomous data processing pipeline
These capabilities are not reserved for conversational RAG use cases. Lexiane can be deployed as a pure data processing pipeline, independently of any generation interface:
- Mass document extraction and normalization on your existing archives
- PII detection and anonymization on a corpus before regulatory migration or archiving
- Knowledge graph construction from your reference documents
- Cryptographic audit of all your document flows
- Vector indexing of your base for semantic search without LLM
The same architectural rigor, the same audit trail, the same data protection — applied to your existing processing flows, without a conversational interface if you do not need one.
Three deployment modes, one pipeline
Air-gapped — absolute sovereignty
Parsing, chunking, enrichment, PII filtering, vector indexing, and graph construction: the entire pipeline executes locally in a single binary. Zero network calls. Zero outbound data. Deployable in a classified network, a sovereign datacenter, or an industrial site without permanent connectivity.
Cloud — maximum power
Cloud embedding models and LLMs (OpenAI, Anthropic) activated via environment variable. The pipeline remains identical — only the adapters change. If tomorrow you replace OpenAI with a self-hosted model, your processing pipeline does not change by a single line.
Hybrid — sensitive data on-site, generation in the cloud
Embeddings are computed locally on your documents. Generation is delegated to a cloud model only on anonymized context fragments. Your source documents never leave. The cloud LLM receives excerpts — not your files.
Verifiable technical guarantees
| Guarantee | Enforcement mechanism |
|---|---|
| No risky memory operations in the core | #![forbid(unsafe_code)] enforced by the compiler — not by code review |
| No ignorable error paths | #[must_use] on all results — an ignored path is a compilation error |
No unwrap() / panic!() in production | Guaranteed by continuous automated testing |
| Audit chain integrity | Chained SHA-256 — any modification is mathematically detectable |
| Validation of dependencies between stages | At assembly, before execution — configuration errors do not reach the runtime |
| Zero vendor dependencies in the certified core | Verified by automated test at compilation |
1,254 automated tests run continuously. 27 independent modules, each with its own compilation boundaries. 25 abstraction interfaces define all contact points between the core and the outside.
What your teams gain concretely
For your CISO Every piece of data processed is traced. Every PII policy is applied mechanically, not by convention. The SHA-256 audit trail constitutes independent technical proof of your processing procedures — consultable, exportable, inviolable.
For your DPO GDPR compliance is not a checkbox after deployment. It is inscribed in the architecture: personal data cannot reach your vector store or your LLM without having been processed according to your rules. The processing register is auditable from the cryptographic chain.
For your CTO A single binary, no runtime, no package manager, no secondary server. The entire pipeline — parsing, chunking, PII, embeddings, indexing — deploys like any Linux binary. No 800 MB Docker image. No Python dependencies to maintain. One TOML configuration file. That is all.
For your compliance teams in regulated sectors Lexiane is the only AI document processing engine designed for certification. IEC 62304 Ed. 2 (publication expected August 2026) will introduce explicit requirements on AI/ML systems in medical devices. ISO 26262 for automotive. Lexiane is compilable with Ferrocene, the Rust compiler qualified ASIL D / SIL 4. Your qualification dossier traces from the deployed binary back to the compiler used to produce it.
Let’s talk about your document corpus.
Every data processing task has its own constraints: format, volume, sensitivity, sector regulation, auditability requirements. We do not offer generic demonstrations.
We offer an exchange on your concrete case: your documents, your constraints, your compliance questions. And an honest assessment of what Lexiane can do — including if the answer is “not yet” or “not like this.”
→ Contact us
Request access to the Auditable Core
Sign up to be notified when our Core audit programme opens. In accordance with our privacy policy, your professional email address will be used exclusively for this technical communication, with no subsequent marketing use. Access distributed via secure private registry.
Contact us