- ✅ Consistency – Ensure your model responses align with requirements across multiple test cases.
- 🛡 Reliability – Catch regressions early by enforcing rules on generated outputs.
- ⚡ Automation – Run evaluations at scale without manually inspecting every response.
- 📈 Comparability – Use metrics to measure improvements between different model versions.
How Metrics Work
When you run a test in Trusys, you define one or more metrics per test case. The system evaluates the model output against these rules. If the metric passes, the test is marked successful; if not, it fails. Each metric comes with:- Name: The type of check (e.g.,
equals,contains). - Description: What the check does.
- Value: The expected reference value(s) or grading criteria etc
Available Metrics
Matching & Validation
Text Matching
These metrics check for textual similarity or presence.equals
Checks if the output exactly matches the expected value.
15 → Pass
contains
Checks if the output contains the expected substring.
Hi, hello world! → Pass
icontains
Case-insensitive substring check.
hello everyone → Pass
contains-all
Checks if output contains all provided strings.
apple and banana smoothie → Pass
contains-any
Checks if output contains at least one provided string.
The task was completed → Pass
Negated Assertions
The inverse of text-matching metrics.not-equals
Passes if output does not equal expected value.
success → Pass
not-contains
Fails if output contains forbidden text.
allowed content → Pass
Numeric Validation
equals-number
Checks if numeric output matches exactly.
42 → Pass
greater-than
Ensures output number > expected.
15 → Pass
less-than
Ensures output number < expected.
42 → Pass
Format Validation
Checks structural correctness of outputs.is-json
Validates output is a valid JSON.
{ "a": 1 } → Pass
contains-json
Checks if JSON structure is embedded inside output.
is-xml, contains-xml, is-sql
Validate if output conforms to XML/SQL formats.
JSON / Structured Validation
json-equals
Validates JSON equality.
{ "status": "ok" } → Pass
array-length
Checks array length.
[1,2,3] → Pass
LLM as a Judge Metrics
similar
Checks semantic similarity with threshold.
llm-rubric
Grades output based on custom rubric via an LLM.
factuality
Assesses factual correctness relative to ground truth.
g-eval
Generalized evaluation using a language model.
Conversational Evaluation
conversational-g-eval
Checks conversation quality holistically.
knowledge-retention
Ensures model retains context across conversation.
role-adherence
Validates if assistant maintains defined role.
conversation-relevancy
Checks if responses are contextually relevant.
RAG Metrics
RAG metrics evaluate different aspects of the retrieval-augmented generation pipeline, from the relevance of retrieved context to the faithfulness of generated answers.context-faithfulness
Purpose: Ensures the generated answer doesn’t hallucinate information beyond what’s provided in the retrieved context.
Use Case: Critical for applications where factual accuracy is paramount, such as medical, legal, or financial systems.
Required Fields
query: The user’s original questioncontext: Retrieved text (provided viatrusysfield) or user input See Implementation Requirements below to learn how to provide context information in thetrusysfield
context-recall
Purpose: Measures how much of the retrieved context was actually utilized in generating the answer.
Use Case: Validates that your retrieval system is finding relevant, useful information rather than just tangentially related content.
Required Fields
value: The expected answer or ground truthcontext: Retrieved text (provided viatrusysfield) Or user input See Implementation Requirements below to learn how to provide context information in thetrusysfield
context-relevance
Purpose: Evaluates the quality of retrieved context by measuring how much of it is actually needed to answer the query.
Use Case: Helps optimize retrieval systems by identifying when too much irrelevant information is being retrieved.
Required Fields
query: The user’s original questioncontext: Retrieved text (provided viatrusysfield) or user input See Implementation Requirements below to learn how to provide context information in thetrusysfield
Agentic Metrics
Agentic metrics evaluate the performance of AI agents that can use tools and complete complex tasks. These metrics assess both the technical correctness of tool usage and the overall task completion effectiveness.tool-correctness
Purpose: Verifies whether an agent correctly invoked tools with appropriate parameters and usage patterns.
Use Case: Essential for ensuring agents can reliably interact with external systems, APIs, and services.
Required Data
expected tools called: List of tools extpected to be called Tool invocation details must be provided in thetrusys.tool_callsarray field see Implementation Requirements below
task-completion
Purpose: Assesses whether an agent successfully completed its assigned task, regardless of the specific tools used.
Use Case: Measures end-to-end agent effectiveness for complex, multi-step workflows.
Tool invocation details must be provided in the trusys.tool_calls array field see Implementation Requirements below
argument-correctness
Purpose: Evaluates whether an agent provided correct and appropriate arguments when invoking tools, ensuring proper parameter usage and data types.
Use Case: Critical for validating that agents can correctly interact with APIs and tools by providing valid, well-formed parameters that match the expected schema.
Tool invocation details must be provided in the trusys.tool_calls array field see Implementation Requirements below
Implementation Requirements for RAG and Agentic Metrics
To enable tool-related metrics (tool-correctness, task-completion) and context-related metrics (context-faithfulness, context-relevance, context-recall), your application must include additional metadata in its responses.For JSON Responses:
If your application already returns JSON, add a trusys field:Multimodal Metrics
image-helpfulness
Evaluates helpfulness of image-based response.
multimodal-contextual-relevancy
Checks relevancy across text + image inputs.
multimodal-contextual-recall
Ensures multimodal outputs recall relevant context.
Audio Analysis
audio-sentiment
Detects sentiment in audio.
audio-call-category
Classifies type of call.
audio-language
Detects spoken language.
audio-interruption-handling
Checks ability to handle interruptions gracefully.
audio-false-intent-identification
Detects false or misleading intent in audio input.
Code Evaluation
code-execution
Checks if output code executes without errors.
code-language
Validates if code is written in expected language.
contains-code-language
Ensures output contains code of a specific language.
code-readability
Grades code readability.
code-correctness
Checks correctness of code relative to spec.
Content Moderation
moderation
Flags inappropriate, unsafe, or disallowed content.
Video Metrics
Trusys provides a suite of video evaluation metrics to assess the quality, accuracy, and consistency of AI-generated or model-produced videos. These metrics help benchmark model performance and identify visual, temporal, and factual weaknesses.video-visual-quality
Evaluates the visual fidelity of the generated video, including aspects like resolution, clarity, color accuracy, and the absence of artifacts such as blurriness, distortion, or pixelation.
text-to-video
Measures semantic alignment between the input text prompt and the generated video content.
video-temporal-consistency
Assesses frame-to-frame continuity to ensure smooth motion, natural transitions, and absence of flickering or scene instability.
video-factual-consistency
Evaluates the factual and logical accuracy of visual content — ensuring that scenes adhere to real-world physics, object integrity, and causal coherence.
video-g-eval
An LLM-based holistic evaluation metric that assesses overall video quality across customizable dimensions such as creativity, realism, alignment, and storytelling.
Evaluation Requirements:
- Generated video.
- Grading Criteria (mandatory): Text-based rubric defining what aspects the model should evaluate (e.g., “creativity,” “consistency,” “prompt alignment”).
General Metrics (Performance)
bleu
Measures n-gram overlap with reference text.
rouge-n
Evaluates recall of n-grams against reference.
latency
Records response time.
perplexity
Measures fluency and predictability of text.
Example Test Case
15 in under 2 seconds → Passes all metrics
❌ Output: Sixteen or slow response → Fail
This page now covers all metric groups available in Trusys.ai for validating and benchmarking AI outputs.