So you’ve built an LLM feature. It responds. It sounds smart. But — is it actually correct? Is it safe? Does it hold up when users type like normal humans with typos and weird phrasing?
If you don’t have answers to those questions, you’re not done yet.
That’s exactly what this tutorial will help you fix. We’re going to walk through FmEval — AWS’s open-source evaluation library — from installation all the way to running real evaluations on your model. By the end, you’ll know how to set up a proper eval pipeline, pick the right metric for your task, and find exactly which samples your model is failing on.
Let’s get into it.
What Is FmEval?
FmEval is an open-source evaluation library by AWS — part of Amazon SageMaker Clarify — that gives you standardized algorithms for testing foundation models across accuracy, toxicity, robustness, and bias.
Here’s a helpful way to think about it: your LLM is a student. FmEval is the teacher with a full grading rubric. Not just “did you answer?” but “did you answer correctly, safely, and consistently?”
The best part — it works with any model. You don’t need a SageMaker account. You don’t need to be on AWS at all. It runs locally, against any endpoint you can hit with Python.
FmEval doesn’t just check if your model responded. It checks if the model responded well, safely, and reliably.
Why You Should Care (Especially for Interviews)
Interviewers for ML/AI roles care deeply about evaluation because it’s where most projects silently fail. Anyone can hook up an LLM and get it talking. Fewer engineers can tell you whether it’s actually reliable.
If you can name specific eval algorithms, explain which metric fits which task, and walk through how you’d set up an eval pipeline — you’re already ahead of most candidates in the room.
Three things worth internalizing before we write any code:
- Evaluation is a first-class engineering concern, not an afterthought
- Different tasks need different metrics — using ROUGE to evaluate toxicity is a category error
- You can run evals without a live model at all, using pre-computed outputs — we’ll cover this shortly
Step 1 — Install FmEval
Let’s start with installation. It’s a single command:
pip install fmeval
Verify it worked:
python -c "import fmeval; print(fmeval.__version__)"
One heads-up: FmEval uses Ray under the hood for parallel processing. It installs automatically, but if you hit environment issues, install it explicitly:
pip install "ray[default]"
If your data lives in S3, you’ll also need AWS credentials:
aws configure
# or via environment variables
export AWS_ACCESS_KEY_ID=your-key
export AWS_SECRET_ACCESS_KEY=your-secret
export AWS_DEFAULT_REGION=us-east-1
Local file paths work perfectly fine too — S3 is optional.
Step 2 — Understand the Three Core Concepts
Before writing any real eval code, let’s understand how FmEval is structured. It has three moving parts that work together:
┌──────────────────┐ ┌──────────────────────┐ ┌──────────────────┐
│ DataConfig │ │ │ │ EvalOutput │
│ ─────────────── │────▶│ EvalAlgorithm │────▶│ ──────────────── │
│ Describes your │ │ ──────────────── │ │ Aggregate scores │
│ dataset fields │ │ Runs the eval and │ │ + per-sample │
│ + file location │ │ applies the right │ │ results saved │
└──────────────────┘ │ metrics for task │ │ to disk │
└──────────────────────┘ └──────────────────┘
┌──────────────────┐ ▲
│ ModelRunner │ │
│ ─────────────── │──────────────┘
│ Adaptor for any │
│ model endpoint │
└──────────────────┘
Get comfortable with these three — everything else in FmEval is just configuring them. Let’s go through each one.
DataConfig — Describing Your Dataset
DataConfig is simply a map of your data. You’re telling FmEval: “here’s my file, this field is the question, this field is the expected answer.”
from fmeval.data_loaders.data_config import DataConfig
from fmeval.constants import MIME_TYPE_JSONLINES
data_config = DataConfig(
dataset_name="my_qa_dataset", # just a label — doesn't affect scoring
dataset_uri="data/qa_sample.jsonl", # local path or s3://bucket/file.jsonl
dataset_mime_type=MIME_TYPE_JSONLINES, # JSON Lines is the standard format
model_input_location="question", # the field FmEval sends to your model
target_output_location="answer", # the ground truth / expected answer field
)
Your JSONL file should look like this — one JSON object per line:
{"question": "What is the capital of France?", "answer": "Paris"}
{"question": "Who wrote Hamlet?", "answer": "William Shakespeare"}
{"question": "What year did WWII end?", "answer": "1945"}
Pro tip: There’s a field most tutorials don’t mention — model_output_location. If you already have your model’s outputs saved in the dataset, you can skip running a live model entirely:
data_config = DataConfig(
dataset_name="precomputed_eval",
dataset_uri="data/outputs.jsonl",
dataset_mime_type=MIME_TYPE_JSONLINES,
model_input_location="question",
target_output_location="ground_truth",
model_output_location="model_answer", # your pre-generated outputs go here
)
# No ModelRunner needed — FmEval reads outputs straight from the dataset
results = eval_algo.evaluate(model=None, dataset_config=data_config)
This is more useful than it sounds. Generating outputs is expensive. Evaluating is cheap. Once you have outputs logged, you can re-run evals as many times as you want — no API calls, no extra cost.
ModelRunner — Connecting Your Model
FmEval doesn’t know or care where your model lives. ModelRunner is the adaptor that lets it talk to whatever you’re running.
If you’re on SageMaker JumpStart:
from fmeval.model_runners.sm_jumpstart_model_runner import JumpStartModelRunner
model_runner = JumpStartModelRunner(
endpoint_name="my-llama-endpoint",
model_id="meta-textgeneration-llama-2-7b",
output="[0].generation.text", # JMESPath to extract the output text
content_template='{"inputs": "$prompt"}', # how to format the prompt for your model
)
For any custom HTTP endpoint — which is what most real-world setups look like:
from fmeval.model_runners.model_runner import ModelRunner
class MyModelRunner(ModelRunner):
def __init__(self):
super().__init__(
content_template='{"prompt": $prompt}',
output="text"
)
def predict(self, prompt: str) -> tuple:
import requests
response = requests.post(
"http://localhost:8080/generate",
json={"prompt": prompt},
timeout=30
)
output_text = response.json()["text"]
# Always return a tuple: (output_text, log_probability)
# Use None for log_prob if your model doesn't expose it — most eval algorithms don't need it
return output_text, None
model_runner = MyModelRunner()
The predict method always returns a tuple of (output_text, log_probability). Returning None as the second value is completely fine.
EvalAlgorithm — Running the Evaluation
This is the part that actually runs the evaluation. You pick the task type, and FmEval applies the right metrics automatically.
from fmeval.eval_algorithms.qa_accuracy import QAAccuracy, QAAccuracyConfig
eval_algo = QAAccuracy(QAAccuracyConfig())
results = eval_algo.evaluate(
model=model_runner,
dataset_config=data_config,
save=True # saves per-sample scores to /tmp/ — always do this
)
for result in results:
print(f"\nAlgorithm: {result.eval_name}")
for score in result.dataset_scores:
print(f" {score.name}: {score.value:.4f}")
Now that you understand the three building blocks, let’s look at all the algorithms available to you.
Step 3 — Pick the Right Eval Algorithm
This is the part most tutorials skip. FmEval has seven different evaluation algorithms, each designed for a specific type of task. Here’s how to choose:
What are you evaluating?
│
┌──────────┬───────────────┼───────────────┬──────────┬──────────┐
│ │ │ │ │ │
Q&A / Summar- Text Safety Typo / Bias
Factual ization Class. Check Noise Check
│ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼
QAAccuracy Summar- Classif- Toxicity Semantic Prompt
ization ication Robust- Stereo-
Accuracy Accuracy ness typing
Let’s go through each one.
QAAccuracy — For Question Answering Tasks
Use this when your model answers questions that have a known correct answer.
It computes four metrics automatically: Exact Match, Quasi-Exact Match, F1, and Quasi-F1. The “quasi” versions are more lenient — they strip articles, punctuation, and whitespace before comparing. So “The Eiffel Tower.” and “eiffel tower” both pass quasi-exact match, even though they’d fail strict exact match.
from fmeval.eval_algorithms.qa_accuracy import QAAccuracy, QAAccuracyConfig
eval_algo = QAAccuracy(QAAccuracyConfig())
results = eval_algo.evaluate(model=model_runner, dataset_config=data_config)
SummarizationAccuracy — For Summarization Tasks
Use this when your model summarizes documents and you have reference summaries to compare against.
It automatically computes ROUGE-1, ROUGE-2, ROUGE-L, and BERTScore — all four in one shot.
from fmeval.eval_algorithms.summarization_accuracy import (
SummarizationAccuracy,
SummarizationAccuracyConfig
)
# For summarization, model_input is the source document and target_output is the reference summary
data_config_summ = DataConfig(
dataset_name="news_summaries",
dataset_uri="news_data.jsonl",
dataset_mime_type=MIME_TYPE_JSONLINES,
model_input_location="article", # the full document to summarize
target_output_location="summary", # the reference summary to compare against
)
eval_algo = SummarizationAccuracy(SummarizationAccuracyConfig())
results = eval_algo.evaluate(model=model_runner, dataset_config=data_config_summ)
ClassificationAccuracy — For Classification Tasks
Use this for sentiment analysis, intent detection, topic classification — anything where your model outputs a categorical label.
It computes per-class precision, recall, F1, and balanced accuracy.
from fmeval.eval_algorithms.classification_accuracy import (
ClassificationAccuracy,
ClassificationAccuracyConfig
)
data_config_clf = DataConfig(
dataset_name="sentiment_dataset",
dataset_uri="sentiment.jsonl",
dataset_mime_type=MIME_TYPE_JSONLINES,
model_input_location="review", # the text to classify
target_output_location="sentiment", # the correct label
)
eval_algo = ClassificationAccuracy(
ClassificationAccuracyConfig(
valid_labels=["positive", "negative", "neutral"] # list all expected output classes
)
)
results = eval_algo.evaluate(model=model_runner, dataset_config=data_config_clf)
Toxicity — For Safety Checks
This one gets skipped until something goes wrong in production. Don’t be that team.
Toxicity checks for six categories: toxicity, severe toxicity, obscene content, threats, insults, and identity-based attacks. All scores are 0–1 where a higher score means more toxic content. You want everything near 0.
from fmeval.eval_algorithms.toxicity import Toxicity, ToxicityConfig
toxicity_eval = Toxicity(ToxicityConfig())
results = toxicity_eval.evaluate(
model=model_runner,
dataset_config=data_config,
)
for result in results:
for score in result.dataset_scores:
print(f"{score.name}: {score.value:.4f}")
# All scores should be close to 0.0 for a safe, customer-facing product
If you’re building a customer-facing LLM product and haven’t run toxicity evals, you’re shipping a liability. Five lines of code. No excuse to skip it.
FactualKnowledge — Testing What Your Model Knows
This tests whether the base model can correctly answer factual questions from its own training — without any retrieval context. It measures what the model “knows” as a standalone capability.
The target_output_delimiter is handy when a question has more than one valid answer:
from fmeval.eval_algorithms.factual_knowledge import FactualKnowledge, FactualKnowledgeConfig
eval_algo = FactualKnowledge(FactualKnowledgeConfig(target_output_delimiter="|"))
# Your data can now handle multiple valid answers like this:
# {"question": "Who wrote Hamlet?", "answer": "Shakespeare|William Shakespeare"}
# FmEval checks if the model output matches ANY of the pipe-delimited options
SemanticRobustness — Testing Noise Tolerance
Here’s what robustness actually means in practice: if a user types “whats the captial of france” with a typo, does your model still get it right? Or does it fall apart?
SemanticRobustness tests this by applying small perturbations to your inputs — typos, random capitalization, whitespace changes — and measures how much your model’s accuracy drops as a result.
A delta score near 0 means the model handles noise well. A delta above 0.2 is a real problem — it means your model will fail normal users who don’t type perfectly.
from fmeval.eval_algorithms.semantic_robustness import (
SemanticRobustness,
SemanticRobustnessConfig,
BUTTER_FINGER, # simulates random typos
RANDOM_UPPER_CASE, # random capitalization changes
WHITESPACE_ADD_REMOVE, # adds or removes whitespace
)
eval_algo = SemanticRobustness(
SemanticRobustnessConfig(
perturbation_type=BUTTER_FINGER,
num_perturbations=5, # number of perturbed versions to test per input
)
)
results = eval_algo.evaluate(model=model_runner, dataset_config=data_config)
# Outputs: delta_exact_match and delta_f1
# Near 0 = robust model. Above 0.2 = brittle model.
PromptStereotyping — Testing for Bias
PromptStereotyping checks whether your model consistently favors stereotyped sentence completions over anti-stereotyped ones.
For example: give the model two sentences — “The nurse said she was tired” and “The nurse said he was tired” — and check which one it assigns higher probability to. If it consistently prefers the stereotyped version, that’s measurable bias.
A score near 0.5 means the model is roughly unbiased. Closer to 1.0 means it’s leaning toward stereotypes.
from fmeval.eval_algorithms.prompt_stereotyping import PromptStereotyping
eval_algo = PromptStereotyping()
# This one uses its own built-in benchmark datasets — no custom DataConfig needed
results = eval_algo.evaluate(model=model_runner)
for result in results:
for score in result.dataset_scores:
print(f"{score.name}: {score.value:.4f}")
# is_biased close to 0.5 = good. Close to 1.0 = biased.
Step 4 — Choose the Right Metric
Now that you know which algorithm to use, let’s talk about the individual metrics they produce and when to use each one.
Your Output Type Metric to Reach For
───────────────────────────────── ──────────────────────────────────────
Fixed short answer Exact Match (strict — "Paris" == "paris")
e.g. "Paris", "Yes", "positive" Quasi-Exact (lenient — strips articles)
Open-ended Q&A F1 Score
partial credit matters (word-level overlap, gives partial credit)
Summarization ROUGE-1 / ROUGE-2 / ROUGE-L
(phrase overlap, ROUGE-L is usually best)
Free-form generation BERTScore
paraphrase is acceptable (meaning similarity, not word matching)
Exact Match
Binary check — does the model’s output exactly equal the ground truth, case-insensitive?
“Paris” matches “paris” ✓ — but “The capital is Paris” does not match “Paris” ✗. Use this for classification labels, yes/no answers, or short factual answers where phrasing is fixed and unambiguous.
Quasi-Exact Match
Same as Exact Match but strips articles (“the”, “a”, “an”), punctuation, and whitespace before comparing. So “The Eiffel Tower.” and “eiffel tower” both normalize to the same string and match.
Use this for factual Q&A where minor phrasing differences shouldn’t count against the model.
F1 Score
Measures word-level overlap between your model’s answer and the correct answer — and gives partial credit.
Here’s the intuition: the correct answer has 10 words and your model got 7 of them right. F1 rewards that. It doesn’t require a perfect match. Use this for open-ended Q&A or extractive tasks.
ROUGE-1, ROUGE-2, ROUGE-L
ROUGE measures n-gram overlap between generated and reference text.
- ROUGE-1 → individual word overlap
- ROUGE-2 → two-word phrase overlap (better for capturing coherence)
- ROUGE-L → longest common subsequence (checks if the overall structure matches)
ROUGE-L is usually the most informative. Use the ROUGE family for summarization tasks.
BERTScore
Instead of matching exact words, BERTScore checks if the meaning matches using BERT embeddings. “Car” and “automobile” score high even though they’re completely different strings.
Think of it this way: F1 is a word checker. BERTScore is a meaning checker. Use it for generation tasks where paraphrasing is acceptable — and it usually is.
Quick interview cheat sheet: Exact Match for strict factual, F1 for extractive Q&A, ROUGE for summarization, BERTScore for generation where paraphrase is fine.
Step 5 — Use evaluate_sample for Fast Iteration
Here’s a trick that’ll save you a ton of time during prompt engineering. Instead of running your full dataset every time you tweak a prompt, use evaluate_sample to test a single prediction instantly:
from fmeval.eval_algorithms.qa_accuracy import QAAccuracy, QAAccuracyConfig
eval_algo = QAAccuracy(QAAccuracyConfig())
# No DataConfig, no ModelRunner — just pass a target and an output directly
scores = eval_algo.evaluate_sample(
target_output="Paris",
model_output="The capital of France is Paris."
)
for score in scores:
print(f"{score.name}: {score.value}")
# exact_match_score: 0.0 <- strict match fails: "Paris" != "The capital of France is Paris."
# quasi_f1_score: 0.4 <- partial credit for word overlap
# f1_score: 0.4
Run this after every prompt change. It’s the fastest feedback loop FmEval offers and takes zero setup.
Step 6 — Put It All Together
Now let’s wire everything up in a complete, working example. Here’s the full eval cycle you want to be in:
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Run Eval │────▶│ Check Scores │────▶│ Issues? │
└─────────────┘ └──────────────┘ └──────┬──────┘
▲ │
│ ┌─────────┴──────────┐
│ Yes No
│ │ │
│ ▼ ▼
│ ┌─────────────────┐ ┌──────────────┐
│ │ Inspect Failing │ │ Ship! │
│ │ Samples │ └──────────────┘
│ └────────┬────────┘
│ │
│ ┌────────▼────────┐
└──────────────────────│ Fix: Prompt / │
│ Model / Data │
└─────────────────┘
import json
from fmeval.data_loaders.data_config import DataConfig
from fmeval.constants import MIME_TYPE_JSONLINES
from fmeval.model_runners.model_runner import ModelRunner
from fmeval.eval_algorithms.qa_accuracy import QAAccuracy, QAAccuracyConfig
# Step 1: Set up your dataset config
data_config = DataConfig(
dataset_name="qa_eval",
dataset_uri="qa_test.jsonl",
dataset_mime_type=MIME_TYPE_JSONLINES,
model_input_location="question",
target_output_location="answer",
)
# Step 2: Create a ModelRunner pointing at a local Ollama instance
class MyModelRunner(ModelRunner):
def __init__(self):
super().__init__('{"prompt": $prompt}', output="response")
def predict(self, prompt: str) -> tuple:
import requests
r = requests.post(
"http://localhost:11434/api/generate",
json={"model": "llama3", "prompt": prompt, "stream": False}
)
return r.json()["response"], None
model_runner = MyModelRunner()
# Step 3: Run the evaluation
eval_algo = QAAccuracy(QAAccuracyConfig())
results = eval_algo.evaluate(
model=model_runner,
dataset_config=data_config,
save=True # saves per-sample results to /tmp/ — always use this
)
# Step 4: Print aggregate scores
print("=== Dataset Scores ===")
for eval_output in results:
for score in eval_output.dataset_scores:
print(f" {score.name:30s} {score.value:.4f}")
# Step 5: Find exactly which samples are failing
failing = []
with open("/tmp/eval_results.jsonl") as f:
for line in f:
sample = json.loads(line)
if sample.get("f1_score", 1.0) < 0.3:
failing.append(sample)
print(f"\n=== {len(failing)} samples with F1 < 0.3 ===")
for s in failing[:5]:
print(f" Q: {s['question']}")
print(f" Expected: {s['answer']}")
print(f" Got: {s['model_output']}\n")
Aggregate scores tell you there’s a problem. Per-sample scores tell you exactly which questions are failing and why. Always use save=True — you’ll thank yourself later.
Bonus — Running Multiple Evals in One Pass
You don’t need to run algorithms one at a time. Run them all together and get a full picture in one shot:
from fmeval.eval_algorithms.qa_accuracy import QAAccuracy, QAAccuracyConfig
from fmeval.eval_algorithms.toxicity import Toxicity, ToxicityConfig
from fmeval.eval_algorithms.semantic_robustness import SemanticRobustness, SemanticRobustnessConfig, BUTTER_FINGER
algorithms = [
QAAccuracy(QAAccuracyConfig()),
Toxicity(ToxicityConfig()),
SemanticRobustness(SemanticRobustnessConfig(perturbation_type=BUTTER_FINGER, num_perturbations=3)),
]
all_results = {}
for algo in algorithms:
results = algo.evaluate(model=model_runner, dataset_config=data_config)
for eval_output in results:
for score in eval_output.dataset_scores:
all_results[f"{eval_output.eval_name}/{score.name}"] = round(score.value, 4)
for metric, value in all_results.items():
print(f"{metric}: {value}")
What FmEval Can’t Do
Before we wrap up, it’s worth knowing where FmEval’s limits are so you don’t lean on it for the wrong things:
- It doesn’t evaluate RAG pipelines end-to-end — use Ragas for that
- It doesn’t handle multi-turn conversation eval out of the box
- It doesn’t replace human evaluation for subjective, high-stakes outputs
- SemanticRobustness only covers surface-level noise, not full semantic paraphrases
- It tells you that your model is failing — not always why
FmEval gives you reproducible signals. Your job is to act on them.
What to Take Away From This
If you build one habit from this tutorial, let it be this: define your eval before you tune your model. The metric you optimize for shapes everything downstream.
Four practical takeaways:
-
Use
evaluate_sampleduring prompt engineering — instant feedback, zero setup. -
Log your model outputs and run offline evals — generate once, evaluate many times without burning API budget.
-
Run Toxicity and SemanticRobustness before any production launch — both are five-line evals that can save you from a very bad day.
-
Match the algorithm to the task — QAAccuracy for Q&A, SummarizationAccuracy for summaries, ClassificationAccuracy for labels. Using the wrong one gives you false confidence.
The difference between a prototype and a production-grade LLM feature is usually not the model.
It’s the eval pipeline. Now you know how to build one.