← Field notesEngine log

How to measure whether your AI is actually working

Ryan Walker7 min readUpdated June 12, 2026

How to measure whether your AI is actually working

Most solopreneurs cannot answer the question: is my AI actually working? They know they are using it. They do not know if it is producing better outcomes.

Measurement is not optional. It is the feedback loop that makes the system improve. Without it, you are running an experiment with no data — spending time and money on a process you cannot evaluate.

The three measurement categories

There are three categories of AI measurement, ordered by difficulty and importance.

Time saved. Hours per week returned by the AI workflow. Easiest to measure, least important. Time saving is a vanity metric if the outputs are low quality or the business outcomes do not move.

Quality improvement. Critic pass rate, client feedback, revision requests. Medium difficulty, medium importance. Quality is a leading indicator — it predicts whether business outcomes will follow.

Business outcome. Conversion rate, revenue, citation rate, client retention. Hardest to measure, most important. This is the only category that tells you whether the AI is actually working.

Most people measure only the first. The third is the one that matters.

How to measure time saved

Time the task before AI. Time it after. The delta is your time saving.

Do this for each AI workflow separately. A content workflow and a client-response workflow have different baselines and different improvement curves. Aggregate numbers hide the signal.

Track weekly. If the time saving is not growing over the first 90 days, the prompt needs refinement. A well-tuned prompt gets faster as you iterate. A stagnant time saving means the workflow has plateaued before it should.

A reasonable target: 40–60% time reduction per task within 90 days of deployment. If you are not approaching that, the prompt or the process is the bottleneck.

How to measure quality improvement

Track your critic pass rate: what percentage of AI outputs pass your evaluation criteria without revision?

Define your evaluation criteria before you start measuring. For a content workflow, that might be: accurate, on-brand, no factual errors, under 800 words. For a client-response workflow: correct tone, addresses the question, no hallucinated commitments. Write the criteria down. Apply them consistently.

Track client feedback. Are revision requests decreasing week over week? A declining revision rate is a direct signal that output quality is improving.

Track your own editing time. If you are spending 45 minutes editing a 600-word AI draft, the critic pass rate is low even if you never formally measured it. Editing time is a proxy metric that requires no additional tooling.

These are leading indicators. They move before business outcomes do. If your critic pass rate is rising and revision requests are falling, business outcomes will follow — usually within 60–90 days.

How to measure business outcomes

Define the metric before deployment. If you cannot name the business outcome you expect the AI to move, you are not ready to deploy.

Content: citation rate in AI search engines, organic traffic, time on page. Citation rate is the GEO metric — how often your content is surfaced by ChatGPT, Perplexity, or Google AI Overviews. Track it monthly.

Support: first-response time, resolution rate, customer satisfaction score. First-response time is the easiest to measure and often the fastest to move.

Sales: lead qualification accuracy, proposal win rate. Qualification accuracy requires a definition of a qualified lead before you can measure it.

Marketing: conversion rate, cost per acquisition. These are the hardest to attribute to AI specifically, but they are the most important to track.

Pick one metric per workflow. Measure it before deployment. Measure it 30, 60, and 90 days after. The comparison is the answer.

The weekly 10-minute measurement review

Every week, spend 10 minutes reviewing three numbers: time saved this week, critic pass rate this week, and one business outcome metric.

Write them down. A spreadsheet is sufficient. The act of writing forces you to look at the number rather than assume it.

Compare to last week. You are not looking for dramatic swings — you are looking for direction. Three consecutive weeks of decline in any metric is a signal that requires investigation.

If any number is declining, investigate before it compounds. A declining critic pass rate that goes unaddressed for four weeks becomes a client retention problem. Catch it at week one.

What to do when the numbers are wrong

Diagnose before switching tools. Most AI measurement problems are prompt problems, not tool problems.

If time saved is declining: the prompt needs refinement. The task may have grown in complexity, or the prompt was never precise enough to handle edge cases. Rewrite the prompt. Retest.

If critic pass rate is declining: the evaluation criteria need updating. Your standards may have risen, or the use case has drifted from the original prompt design. Update the criteria, then update the prompt to match.

If business outcomes are not improving: the wrong task is being automated. This is the hardest diagnosis. It means the AI is working correctly on a task that does not move the metric you care about. Identify the task that actually drives the outcome and automate that instead.

Switching tools is rarely the answer. The problem is almost always in the prompt, the criteria, or the task selection.

The 90-day measurement baseline

After 90 days of measurement, you have a baseline. You know what your AI stack produces, at what quality level, with what business impact.

That baseline is the foundation for every improvement decision after it. Without it, every change is a guess. With it, every change is a test with a control.

A 90-day baseline for a content workflow might look like: 4.2 hours saved per week, 74% critic pass rate, citation rate up from 0 to 3 mentions per month in AI search. Those numbers tell you exactly where to invest next — and what a meaningful improvement looks like.

Start measuring on day one. The baseline is not something you build later. It is something you build from the beginning, so that 90 days from now you have data instead of impressions.

We send our AI measurement dashboard template — the three metrics, the weekly review format, and the 90-day baseline tracker — to Field Notes subscribers. Get it at avakata.agency/contact.html.

If you want to see how Avakata measures AI performance across a live site — and what the numbers look like after 90 days — book a discovery call. We will walk through the measurement framework and show you what a baseline looks like in practice.

Frequently asked questions

How do I know if my AI is working?
Measure three categories: time saved (hours per week returned by the AI workflow), quality improvement (critic pass rate and client feedback), and business outcomes (conversion rate, citation rate, revenue). Define the metric for each workflow before deployment. If you cannot show better business outcomes after 90 days, the implementation is wrong — not the technology.
What is a critic pass rate?
Critic pass rate is the percentage of AI outputs that pass your evaluation criteria without requiring revision. It is a leading indicator of output quality. A rising critic pass rate means your prompts are improving. A declining critic pass rate means your evaluation criteria have drifted or your prompts need refinement.
How often should I review my AI metrics?
Weekly, for 10 minutes. Review three numbers: time saved this week, critic pass rate this week, and one business outcome metric. Write them down and compare to last week. If any number is declining, investigate before it compounds. After 90 days, you have a baseline that informs every improvement decision.

Related reading