When AI first entered Quality Engineering conversations, the discussion was almost inevitable:
Can this help us?
So teams experimented.
They ran models against logs. They tried generating tests. They asked AI to explain failures, suggest assertions, even “improve coverage”.
At first, everything looked promising.
And then something more interesting happened.
The question stopped being whether AI could help, and became where it quietly made things worse.
This post is written from that point — not exploration, but boundary-setting after experience.
Where AI Actually Helped
The first place AI delivered durable value was not test creation.
It was failure explanation.
Given logs, traces, or error output, AI could:
- summarize noisy failures,
- group related issues,
- surface patterns across runs,
- reduce the cognitive load of triage.
Importantly, none of this changed system behavior.
AI didn’t decide what was correct. It didn’t assert expectations. It didn’t modify tests.
It simply made existing signals easier to understand.
That distinction turned out to matter more than anything else.
Where It Started to Hurt
Problems didn’t appear as crashes or obvious defects.
They appeared as confidence.
AI-generated tests looked reasonable. AI-generated assertions sounded correct. AI-generated explanations felt complete.
And when AI was wrong, it wasn’t obviously wrong.
It was plausibly wrong.
A broken test fails loudly. A wrong test passes quietly.
Over time, teams noticed a pattern:
- coverage numbers went up,
- confidence went up,
- understanding went down.
That’s when the boundaries began to form.
Why Verifiability Became the Line
AI worked best when its output could be deterministically challenged.
If a summary didn’t match logs, it could be rejected. If an explanation contradicted known behavior, it could be questioned. If a pattern didn’t hold across runs, it could be dismissed.
The moment output couldn’t be verified, risk didn’t disappear.
It just moved downstream.
Quality Engineering isn’t about generating artifacts. It’s about preserving confidence under change.
Unverifiable output weakens that confidence, no matter how fast it’s produced.
Why Explanation Scaled and Creation Didn’t
Failure explanation scales because it supports judgment.
Test creation doesn’t, because it replaces it.
Tests encode intent:
- what correctness means,
- which paths matter,
- what must never break.
When that intent came from AI, no one truly owned it.
And without ownership, accountability dissolved.
That’s why this rule emerged naturally, not philosophically:
AI may explain what happened. It must not decide what should exist.
Hallucinations Weren’t the Real Problem
At some point, hallucinations entered the discussion.
But teams that looked closely realized something uncomfortable:
Hallucinations weren’t passing because models were bad. They were passing because nothing was positioned to reject them.
Weak assertions. Implicit expectations. Missing validation checkpoints.
AI didn’t bypass safeguards.
There simply weren’t any.
Once validation became explicit, hallucinations became boring — and boring failures are the safest kind.
Why Maintenance Felt Safe
Test maintenance told a different story.
Selectors drift. Fields get renamed. Refactors happen.
AI could help spot those changes without redefining intent.
The original test still existed. The original rationale still mattered. A human still decided what changed.
That asymmetry mattered more than any benchmark.
Guardrails Changed the Conversation
Eventually, guardrails stopped being about control and started being about containment.
The goal wasn’t to make AI perfect.
It was to make AI mistakes:
- local,
- visible,
- reversible.
Human approval stopped feeling like friction and started feeling like design.
That’s when things stabilized.
The Model That Survived
What survived experimentation was simple and conservative:
- AI assists; humans decide
- Verification precedes acceptance
- Ownership is explicit
- Unverifiable output is rejected
If no one can explain why something exists, it shouldn’t. If no one owns a decision, it’s already a liability.
Closing Thought
Quality Engineering has always existed to answer one question:
How confident are we — and why?
AI helps when it strengthens that answer. It hurts when it replaces it with plausibility.
Drawing boundaries isn’t anti-AI.
It’s what happens after you’ve actually used it.