Lexiane is an end-to-end document processing pipeline, designed for organizations that cannot leave their data in the hands of a third party. Ingestion, parsing, semantic chunking, personal data detection, enrichment, vector indexing, cryptographic audit: every step executes in a single binary, on your infrastructure, without any network call.

The problem your data poses to most AI solutions

RAG platforms and AI tools on the market place you before a structurally unfavorable choice: send your documents to a third-party vendor’s cloud, or forgo artificial intelligence.

This choice is presented as a technical trade-off. It is in reality a transfer of risk — legal, regulatory, strategic. Your internal procedures, your contracts, your patient data, your financial reports, your technical specifications: the moment they leave your perimeter, you lose control of what happens to them.

Lexiane starts from the opposite principle: the processing of your data happens where it is, with the guarantees you have defined — not those your vendor allows you.

A complete document processing pipeline, with no external dependencies

Native parsing of your document formats

The first link in quality data processing is the ability to read your documents as they are, in their production formats. Lexiane’s parser is written in pure Rust — no Python dependency, no third-party service, no network call.

Natively supported formats:

Format	Typical use cases
PDF	Reports, contracts, specifications, regulatory dossiers
Excel (.xlsx, .xls, .ods)	Data tables, budgets, inventories, reference repositories
PowerPoint (.pptx)	Presentations, training materials, strategic slides
HTML	Intranet pages, wiki exports, web documentation
Markdown	Technical documentation, knowledge bases, structured notes
Plain text	Notes, exported emails, logs, semi-structured data

A single binary reads, parses, and indexes your documents. No Python interpreter to maintain, no secondary server to operate, no additional attack surface.

Semantic chunking with configurable granularity

The quality of document processing depends not only on what you read — it depends on how you cut it. Poor chunking produces fragments that cut ideas mid-sentence, separate a question from its answer, or break the coherence of a table.

Lexiane’s chunking engine operates with configurable precision:

Size and overlap adapted to the nature of your corpus
Respect for linguistic boundaries down to the Unicode grapheme — your documents in French, Arabic, Chinese, or Japanese are chunked correctly
Parent-child hierarchy: each fragment retains a reference to its parent context, retrievable at generation to return the complete passage
Recursive semantic chunking: the system respects the document structure — paragraphs, sections, lists — rather than mechanically counting characters

The result: fragments that make sense independently, precisely indexable, contextualizable at retrieval.

Automatic enrichment before indexing

Each document fragment goes through an enrichment step before vectorization. The goal: increase retrieval quality by adding to each segment the metadata that makes it more precisely findable.

Applied enrichments:

Token and word count of the segment
Automatic extraction of representative keywords
Segment summary for hybrid search
Augmented content (parent document context injected into the chunk)
Traceability identifiers (source document, position, content hash)

These enrichments are an integral part of the ingestion pipeline — they apply to each document from the first indexing, without any manual step.

Knowledge graph extraction (GraphRAG)

For corpora rich in relationships — regulatory documents, project archives, business knowledge bases, audit reports — vector search alone is not enough. It retrieves similar passages. It does not understand the links between the entities mentioned within them.

Lexiane’s GraphRAG engine automatically extracts knowledge triplets from your documents — subject, predicate, object — and stores them in a persistent RDF triplestore. The resulting base understands relationships between people, organizations, projects, dates, and regulations.

What this makes possible:

“Which suppliers are mentioned in the 2023 audits AND in active contracts?”

“Which projects are linked to this manager and to which regulation?”

“Identify the dependency chains between components mentioned in these 500 technical datasheets.”

Multi-hop traversal of the graph produces information that vector search alone cannot structurally reach.

Personal data protection by architecture

PII filtering integrated into the pipeline

Lexiane’s PII (Personally Identifiable Information) filter operates before any vectorization, any indexing, and any language model call. No sensitive data reaches your vector store or your LLM without having been processed according to your rules.

Personal data detected:

Data type	Examples
Email addresses	`john.doe@company.com`
Phone numbers	National and international formats
IBAN and bank details	`FR76 1234 5678 9012 3456 7890 189`
Social security numbers	National and European formats
IP addresses	IPv4 and IPv6
Configurable identifiers	According to your business reference framework

Configurable processing policies:

Typed masking — replacement with a semantic placeholder [EMAIL], [IBAN], [PHONE]: the type of information remains readable, the value disappears
Deletion — complete removal of the value from the fragment
Hashing — replacement with the cryptographic fingerprint of the value: allows referential consistency without exposing the data

This architecture guarantees GDPR compliance by construction rather than by process: the data cannot reach the storage system before being processed. This is not a best-practice rule. It is a mechanical constraint of the pipeline.

Traceability and cryptographic audit at every step

An inviolable SHA-256 integrity chain

Every action in the processing pipeline is recorded in a cryptographic audit chain. Each entry is signed by the SHA-256 hash of the previous one — any subsequent modification of an event is mathematically detectable.

Events traced in the chain:

Document submitted for ingestion (identifier, content hash, timestamp)
Fragments created and their chunking parameters
Embeddings generated (model, dimension, date)
Entities extracted for the knowledge graph
Personal data detected and policy applied
User queries and documents consulted
Responses produced and their sources

This is not a logging feature. It is a structural integrity guarantee: you can prove at any time that processing occurred exactly as recorded, and that no record was modified after the fact.

For an auditor, a compliance officer, or a regulator, this chain constitutes independent technical proof of your declared processes.

Output quality evaluation

Measured metrics, not assumptions

The quality of a document processing pipeline is not decreed at installation — it is measured in production, on your real data. Lexiane integrates RAGAS evaluation metrics at pipeline output:

Faithfulness: is the produced response grounded in the retrieved sources?
Relevance: do the retrieved sources actually answer the question asked?
Context precision: are the retrieved fragments specifically relevant?
Context recall: did the pipeline retrieve all available information?

Input guardrails detect prompt injection attempts and out-of-scope queries before they reach the pipeline. Output guardrails verify the produced response before transmission to the user.

The relevance gate evaluates the overall confidence score of the retrieved context. If the sources are not sufficiently reliable to produce a grounded response, the system abstains — rather than generating a poorly grounded response. This is the opposite of hallucination: a system that knows when it does not know. For cases requiring multiple retrieval iterations, Agentic RAG automates this process.

Lexiane as an autonomous data processing pipeline

These capabilities are not reserved for conversational RAG use cases. Lexiane can be deployed as a pure data processing pipeline, independently of any generation interface:

Mass document extraction and normalization on your existing archives
PII detection and anonymization on a corpus before regulatory migration or archiving
Knowledge graph construction from your reference documents
Cryptographic audit of all your document flows
Vector indexing of your base for semantic search without LLM

The same architectural rigor, the same audit trail, the same data protection — applied to your existing processing flows, without a conversational interface if you do not need one.

Three deployment modes, one pipeline

Air-gapped — absolute sovereignty

Parsing, chunking, enrichment, PII filtering, vector indexing, and graph construction: the entire pipeline executes locally in a single binary. Zero network calls. Zero outbound data. Deployable in a classified network, a sovereign datacenter, or an industrial site without permanent connectivity.

Cloud — maximum power

Cloud embedding models and LLMs (OpenAI, Anthropic) activated via environment variable. The pipeline remains identical — only the adapters change. If tomorrow you replace OpenAI with a self-hosted model, your processing pipeline does not change by a single line.

Hybrid — sensitive data on-site, generation in the cloud

Embeddings are computed locally on your documents. Generation is delegated to a cloud model only on anonymized context fragments. Your source documents never leave. The cloud LLM receives excerpts — not your files.

Verifiable technical guarantees

Guarantee	Enforcement mechanism
No risky memory operations in the core	`#![forbid(unsafe_code)]` enforced by the compiler — not by code review
No ignorable error paths	`#[must_use]` on all results — an ignored path is a compilation error
No `unwrap()` / `panic!()` in production	Guaranteed by continuous automated testing
Audit chain integrity	Chained SHA-256 — any modification is mathematically detectable
Validation of dependencies between stages	At assembly, before execution — configuration errors do not reach the runtime
Zero vendor dependencies in the certified core	Verified by automated test at compilation

1,254 automated tests run continuously. 27 independent modules, each with its own compilation boundaries. 25 abstraction interfaces define all contact points between the core and the outside.

What your teams gain concretely

For your CISO Every piece of data processed is traced. Every PII policy is applied mechanically, not by convention. The SHA-256 audit trail constitutes independent technical proof of your processing procedures — consultable, exportable, inviolable.

For your DPO GDPR compliance is not a checkbox after deployment. It is inscribed in the architecture: personal data cannot reach your vector store or your LLM without having been processed according to your rules. The processing register is auditable from the cryptographic chain.

For your CTO A single binary, no runtime, no package manager, no secondary server. The entire pipeline — parsing, chunking, PII, embeddings, indexing — deploys like any Linux binary. No 800 MB Docker image. No Python dependencies to maintain. One TOML configuration file. That is all.

For your compliance teams in regulated sectors Lexiane is the only AI document processing engine designed for certification. IEC 62304 Ed. 2 (publication expected August 2026) will introduce explicit requirements on AI/ML systems in medical devices. ISO 26262 for automotive. Lexiane is compilable with Ferrocene, the Rust compiler qualified ASIL D / SIL 4. Your qualification dossier traces from the deployed binary back to the compiler used to produce it.

Let’s talk about your document corpus.

Every data processing task has its own constraints: format, volume, sensitivity, sector regulation, auditability requirements. We do not offer generic demonstrations.

We offer an exchange on your concrete case: your documents, your constraints, your compliance questions. And an honest assessment of what Lexiane can do — including if the answer is “not yet” or “not like this.”

→ Contact us

AI Data Engineering | Sovereign Document Processing | Lexiane