← Field notesAgentic

Your AI stack is only as good as your evaluation layer

Ryan Walker7 min readUpdated May 31, 2026

Your AI stack is only as good as your evaluation layer

Most AI failures are evaluation failures, not generation failures. The model produced something. Nobody checked if it met the standard. That gap — between output and standard — is where quality dies.

The evaluation layer is the most important part of your AI stack. Not the model. Not the prompt. The check that runs after the model and before the output ships.

What an evaluation layer is

An evaluation layer is a step that runs after AI generates output and before that output reaches a user, a customer, or a published page. It checks the output against defined criteria and returns a verdict.

It can take several forms: a human review, a structured checklist, a second AI prompt acting as a critic, or an automated test suite. These are not mutually exclusive. Most mature stacks use more than one.

The minimum viable version is a critic prompt. One additional AI call that reads the first output and scores it against your criteria. It costs a fraction of a cent and catches failures before they ship.

How to write a critic prompt

A critic prompt takes the output of the first AI call and evaluates it against specific criteria. The structure is consistent regardless of what you are evaluating.

  1. Here is the output. Paste the full text of the generated content.
  2. Here are the criteria it must meet. List each criterion explicitly — no vague standards.
  3. Score it on each criterion. Ask for a score (e.g., 1–5) and a one-sentence rationale per criterion.
  4. Flag any failures. Define what constitutes a failure for each criterion.
  5. Return pass or fail with reasons. The final output is a structured verdict, not a paragraph of feedback.

Example: blog post critic prompt

You are a content quality critic. Evaluate the following blog post draft against these criteria:

  1. Brand voice: direct, practitioner, no hype adjectives, no exclamation marks. Score 1–5.
  2. GEO structure: answer-first paragraphs, FAQ block present, freshness stamp present. Score 1–5.
  3. Factual accuracy: no unsupported claims, no fabricated metrics. Score 1–5.
  4. Word count: between 600 and 1,100 words. Pass or fail.
  5. Banned phrases: none of the following appear: revolutionary, game-changing, unlock, supercharge, seamless. Pass or fail.

For each criterion, provide a score and one sentence of rationale. Then return a final verdict: PASS or FAIL. If FAIL, list the specific criteria that failed and why.

The output of this prompt is machine-readable. You can parse it, log it, and route failures back to the generator automatically.

Defining evaluation criteria before you build

You cannot evaluate output against criteria you have not defined. This sounds obvious. Most teams skip it anyway.

Write the criteria before you build the generator. The criteria are the spec. The generator is the implementation. If you build the generator first, you will reverse-engineer your criteria from whatever the model produces — which means your standard is the model average.

For content, a working criteria set looks like this:

  • Brand voice match: tone, vocabulary, and sentence structure align with the defined voice guide
  • Factual accuracy: every claim is either verifiable or explicitly hedged
  • GEO structure: answer-first paragraphs, FAQ block with at least two questions, freshness stamp present
  • Word count range: 600–1,100 words for a Field Note
  • No banned phrases: a defined list of words and constructions that do not appear

Five criteria is enough to start. You can add more as you find new failure modes.

What happens without evaluation

Without an evaluation layer, you ship at the model average. The model average is not your standard.

Over time, this compounds. Your blog posts drift toward generic. Your support responses start sounding like every other AI-assisted support team. Your lead qualification emails lose the specificity that made them convert. The model is consistent — consistently average.

The evaluation layer is what keeps the system on your standard, not the model's. It is the mechanism that enforces the gap between your output and everyone else's.

The Avakata critic gate

Every agent output at Avakata passes through a critic gate before shipping. The critic checks against four dimensions: brand voice, conversion principles, GEO structure, and factual accuracy.

It rejects roughly 23% of first-pass outputs. Those rejections do not go to a human — they go back to the generator with specific failure reasons attached. The generator runs again with the failure context in the prompt. Second-pass outputs pass at a significantly higher rate.

The 23% rejection rate is not a problem. It is the system working. The alternative is shipping that 23% to readers.

We send our critic prompt templates to Field Notes subscribers. Get them at avakata.agency/contact.html.

How to add an evaluation layer this week

This does not require a new tool or a new vendor. It requires one afternoon.

  1. Pick your highest-volume AI output — the thing your stack generates most often.
  2. Write five evaluation criteria for it. Be specific. "Good quality" is not a criterion.
  3. Build a critic prompt using those criteria, following the structure above.
  4. Run it on your last 10 outputs.
  5. Count how many pass.

That number is your current quality baseline. If it is 6 out of 10, you are shipping a 40% failure rate. Now you know. Now you can fix it.

If you want to see how we built the critic gate and what the prompt templates look like, book a discovery call. We will walk through the setup in 30 minutes.

Frequently asked questions

What is an AI evaluation layer?
An evaluation layer is a step that runs after AI generates output and before that output ships. It checks the output against defined criteria — brand voice, accuracy, structure, format. It can be a human review, a checklist, or a critic prompt (a second AI call that reviews the first output). Without it, you ship at the model's average quality, not your standard.
What is a critic prompt?
A critic prompt is a second AI call that takes the output of the first call and evaluates it against specific criteria. It scores the output on each criterion, flags failures, and returns a pass or fail with reasons. It is the minimum viable evaluation layer for any AI workflow.
How many AI outputs does the Avakata critic gate reject?
Roughly 23% of first-pass outputs are rejected by the Avakata critic gate. Those outputs are returned to the generator with specific failure reasons and regenerated. The critic checks for brand voice match, conversion principles, GEO structure, and factual accuracy.

Related reading