Ingest & decode
Open the file safely and confirm it is readable UTF-8. Corrupt encodings, binary blobs, and unparseable JSON are caught before they poison anything downstream.
Garbage in means garbage embeddings — and silent retrieval failures later.
Profile the structure
Detect rows/columns or document blocks, infer column types, count distinct values and nulls. This is the map the rest of the pipeline navigates by.
You can't clean what you haven't measured.
Clean & normalize
Drop empty and duplicate rows, fix inconsistent dates and casing, resolve missing values, and strip dead columns and boilerplate.
Duplicates skew retrieval; boilerplate wastes tokens; inconsistency confuses the model.
Apply org guidelines
Enforce the rulebook: admissibility, freshness/versioning, provenance and licensing, and required taxonomy/metadata tags.
This is where a stale Policy v2.7 gets quarantined before it can contradict v3.1 in answers.
Clear sensitive data
Scan for emails, phones, SSNs, cards and IPs. Redact, mask, or escalate for sign-off before anything is embedded.
Anything embedded becomes retrievable — PII leaks are a one-way door.
Chunk-readiness & gate
Confirm the content segments cleanly within the embedding band, then roll everything into a readiness score and an explicit gate. A human owner signs off.
An honest gate keeps the knowledge base — and every answer built on it — trustworthy.
The org rulebook
Every guideline a file is scored against, and why it matters downstream
The approval gate
No file is handed to the RAG Evaluator without clearing all four