Data

STEP 1

Ingest & decode

Open the file safely and confirm it is readable UTF-8. Corrupt encodings, binary blobs, and unparseable JSON are caught before they poison anything downstream.

Garbage in means garbage embeddings — and silent retrieval failures later.

STEP 2

Profile the structure

Detect rows/columns or document blocks, infer column types, count distinct values and nulls. This is the map the rest of the pipeline navigates by.

You can't clean what you haven't measured.

STEP 3

Clean & normalize

Drop empty and duplicate rows, fix inconsistent dates and casing, resolve missing values, and strip dead columns and boilerplate.

Duplicates skew retrieval; boilerplate wastes tokens; inconsistency confuses the model.

STEP 4

Apply org guidelines

Enforce the rulebook: admissibility, freshness/versioning, provenance and licensing, and required taxonomy/metadata tags.

This is where a stale Policy v2.7 gets quarantined before it can contradict v3.1 in answers.

STEP 5

Clear sensitive data

Scan for emails, phones, SSNs, cards and IPs. Redact, mask, or escalate for sign-off before anything is embedded.

Anything embedded becomes retrievable — PII leaks are a one-way door.

STEP 6

Chunk-readiness & gate

Confirm the content segments cleanly within the embedding band, then roll everything into a readiness score and an explicit gate. A human owner signs off.

An honest gate keeps the knowledge base — and every answer built on it — trustworthy.

The org rulebook

Every guideline a file is scored against, and why it matters downstream

Admissibility
Only approved source types and classifications may enter the knowledge base.
Inadmissible sources become answerable facts the AI was never meant to surface.
Format & Schema
Files must decode cleanly and meet structure/encoding standards.
Broken encoding and malformed rows produce garbled chunks and silent retrieval gaps.
De-duplication
One authoritative copy per fact; near-duplicates are flagged.
Duplicates inflate retrieval frequency and bias the AI toward repeated content.
Freshness & Versioning
Stale or superseded versions must be quarantined before ingestion.
Stale versions cause the conflicting answers the RAG lab reports (e.g. Policy v2.7 vs v3.1).
Privacy & PII
PII must be redacted or explicitly approved before embedding.
Anything embedded becomes retrievable — a privacy-gate failure in governance.
Provenance & Licensing
Source and usage rights verified by a named owner.
Unlicensed content embedded into answers is a compliance and legal exposure.
Taxonomy & Metadata
Required tags present: domain, owner, sensitivity, effective date.
Missing metadata blocks filtering, access control, and trustworthy citations.
Chunk-readiness
Content segments cleanly within the embedding target band.
Oversized or boilerplate-heavy chunks degrade retrieval precision.

The approval gate

No file is handed to the RAG Evaluator without clearing all four

Quality
Dedup, completeness & format thresholds met
Privacy
PII redacted or explicitly approved
Provenance
Source & license verified by an owner
Chunking
Chunk sizes within the embedding target band
The Data Lab prevents the failures the RAG Evaluator detects.