Role
Lead · LLM safety layer
Stack
Python · LLM Evals · Schema validation
Year
2025 · Deployed
Hero figure — guardrails pipeline
The problem

Generated tests looked right, but weren't always right.

[The gap: LLMs produced syntactically valid test cases that sometimes missed coverage requirements, used hallucinated APIs, or violated safety rules. Engineers couldn't trust the output without re-reading every case, which defeated the point.]

Approach

Validate before the engineer ever sees it.

[The layers: schema validation against the test framework, coverage checks against the spec, safety filters for sensitive patterns, and a fallback path when generation fails the bar. Each layer rejects or repairs before output reaches the user.]

Validation pipeline diagram
What changed

Engineers started shipping the AI's tests.

[Outcomes: trust went up, review time went down, adoption became real. The number that matters most is whichever you can share — fewer rejected tests, faster iteration, more cases shipped without rewrite.]

Quality metrics — before / after
What I learned

The model isn't the product — the system around it is.

[Reflection: a great LLM with no guardrails is a demo. A modest LLM with strong guardrails is a tool. Most of the work was in the layer that decided what counted as "good enough to ship."]

← Back to all projects