Skip to main content
Metrics in Trusys.ai are evaluation checks you can apply to your model outputs. They help you validate whether a response meets your expectations by comparing the actual output with the expected output under specific rules. Think of metrics as assertions for AI responses—they tell you if the system is behaving correctly according to the criteria you define. Why Metrics Are Important
  • Consistency – Ensure your model responses align with requirements across multiple test cases.
  • 🛡 Reliability – Catch regressions early by enforcing rules on generated outputs.
  • Automation – Run evaluations at scale without manually inspecting every response.
  • 📈 Comparability – Use metrics to measure improvements between different model versions.

How Metrics Work

When you run a test in Trusys, you define one or more metrics per test case. The system evaluates the model output against these rules. If the metric passes, the test is marked successful; if not, it fails. Each metric comes with:
  • Name: The type of check (e.g., equals, contains).
  • Description: What the check does.
  • Value: The expected reference value(s) or grading criteria etc

Available Metrics

Matching & Validation

Text Matching

These metrics check for textual similarity or presence.

equals

Checks if the output exactly matches the expected value.
{ "metric": "equals", "value": "15" }
✅ Output: 15 → Pass

contains

Checks if the output contains the expected substring.
{ "metric": "contains", "value": "hello world" }
✅ Output: Hi, hello world! → Pass

icontains

Case-insensitive substring check.
{ "metric": "icontains", "value": "Hello" }
✅ Output: hello everyone → Pass

contains-all

Checks if output contains all provided strings.
{ "metric": "contains-all", "value": ["apple", "banana"] }
✅ Output: apple and banana smoothie → Pass

contains-any

Checks if output contains at least one provided string.
{ "metric": "contains-any", "value": ["success", "completed"] }
✅ Output: The task was completed → Pass

Negated Assertions

The inverse of text-matching metrics.

not-equals

Passes if output does not equal expected value.
{ "metric": "not-equals", "value": "error" }
✅ Output: success → Pass

not-contains

Fails if output contains forbidden text.
{ "metric": "not-contains", "value": "forbidden" }
✅ Output: allowed content → Pass

Numeric Validation

equals-number

Checks if numeric output matches exactly.
{ "metric": "equals-number", "value": 42 }
✅ Output: 42 → Pass

greater-than

Ensures output number > expected.
{ "metric": "greater-than", "value": 10 }
✅ Output: 15 → Pass

less-than

Ensures output number < expected.
{ "metric": "less-than", "value": 100 }
✅ Output: 42 → Pass

Format Validation

Checks structural correctness of outputs.

is-json

Validates output is a valid JSON.
{ "metric": "is-json" }
✅ Output: { "a": 1 } → Pass

contains-json

Checks if JSON structure is embedded inside output.

is-xml, contains-xml, is-sql

Validate if output conforms to XML/SQL formats.

JSON / Structured Validation

json-equals

Validates JSON equality.
{ "metric": "json-equals", "value": {"status": "ok"} }
✅ Output: { "status": "ok" } → Pass

array-length

Checks array length.
{ "metric": "array-length", "value": 3 }
✅ Output: [1,2,3] → Pass

LLM as a Judge Metrics

similar

Checks semantic similarity with threshold.
{ "metric": "similar", "value": "Paris is capital of France", "threshold": 0.8 }

llm-rubric

Grades output based on custom rubric via an LLM.

factuality

Assesses factual correctness relative to ground truth.

g-eval

Generalized evaluation using a language model.

Conversational Evaluation

conversational-g-eval

Checks conversation quality holistically.

knowledge-retention

Ensures model retains context across conversation.

role-adherence

Validates if assistant maintains defined role.

conversation-relevancy

Checks if responses are contextually relevant.

RAG Metrics

RAG metrics evaluate different aspects of the retrieval-augmented generation pipeline, from the relevance of retrieved context to the faithfulness of generated answers.

context-faithfulness

Purpose: Ensures the generated answer doesn’t hallucinate information beyond what’s provided in the retrieved context. Use Case: Critical for applications where factual accuracy is paramount, such as medical, legal, or financial systems. Required Fields
  • query: The user’s original question
  • context: Retrieved text (provided via trusys field) or user input See Implementation Requirements below to learn how to provide context information in the trusys field

context-recall

Purpose: Measures how much of the retrieved context was actually utilized in generating the answer. Use Case: Validates that your retrieval system is finding relevant, useful information rather than just tangentially related content. Required Fields
  • value: The expected answer or ground truth
  • context: Retrieved text (provided via trusys field) Or user input See Implementation Requirements below to learn how to provide context information in the trusys field

context-relevance

Purpose: Evaluates the quality of retrieved context by measuring how much of it is actually needed to answer the query. Use Case: Helps optimize retrieval systems by identifying when too much irrelevant information is being retrieved. Required Fields
  • query: The user’s original question
  • context: Retrieved text (provided via trusys field) or user input See Implementation Requirements below to learn how to provide context information in the trusys field

Agentic Metrics

Agentic metrics evaluate the performance of AI agents that can use tools and complete complex tasks. These metrics assess both the technical correctness of tool usage and the overall task completion effectiveness.

tool-correctness

Purpose: Verifies whether an agent correctly invoked tools with appropriate parameters and usage patterns. Use Case: Essential for ensuring agents can reliably interact with external systems, APIs, and services. Required Data
  • expected tools called: List of tools extpected to be called Tool invocation details must be provided in the trusys.tool_calls array field see Implementation Requirements below
Example Scenarios
{
  "trusys": {
    "tool_calls": [{
      "name": "weather_api",
      "description": "Get current weather for a location",
      "arguments": {"city": "San Francisco", "units": "celsius"},
      "output": ["Current weather in San Francisco: 18°C, partly cloudy"]
    }]
  }
}

task-completion

Purpose: Assesses whether an agent successfully completed its assigned task, regardless of the specific tools used. Use Case: Measures end-to-end agent effectiveness for complex, multi-step workflows. Tool invocation details must be provided in the trusys.tool_calls array field see Implementation Requirements below

argument-correctness

Purpose: Evaluates whether an agent provided correct and appropriate arguments when invoking tools, ensuring proper parameter usage and data types. Use Case: Critical for validating that agents can correctly interact with APIs and tools by providing valid, well-formed parameters that match the expected schema. Tool invocation details must be provided in the trusys.tool_calls array field see Implementation Requirements below

Implementation Requirements for RAG and Agentic Metrics

To enable tool-related metrics (tool-correctness, task-completion) and context-related metrics (context-faithfulness, context-relevance, context-recall), your application must include additional metadata in its responses.For JSON Responses: If your application already returns JSON, add a trusys field:
"trusys": {                                                         // Required
    "tool_calls": [
        {
            "name": "weather_tool",                                                     // Required 
            "description": "Provides weather information for a given location.",        //Required for Task completion & Argument Correctness
            "arguments": {"city": "New York", "state": "US"},                           //Required for Task completion & Argument Correctness
            "output": ["The current weather in New York is sunny with a temperature of 75°F."]      // Required for Task completion
        }
    ],
    "query": "Which cities are near new york?",                                        // Required for context-faithfulness and context-relevance
    "context": "New jersey Connecticut Philadelphia."                                  // Required for context-recall, context-faithfulness and context-relevance
}
For Non-JSON Responses: If your application returns plain text or other formats, wrap the response in JSON:
{
    "output": "<original_output>",                                                          // Required
    "trusys": {
        "tool_calls": [
            {
                "name": "weather_tool",                                                     // Required 
                "description": "Provides weather information for a given location.",        //Required for Task completion & Argument Correctness
                "arguments": {"city": "New York", "state": "US"},                           //Required for Task completion & Argument Correctness
                "output": ["The current weather in New York is sunny with a temperature of 75°F."]      // Required for Task completion
            }
        ],
        "query": "Which cities are near new york?",                                        // Required for context-faithfulness and context-relevance
        "context": "New jersey Connecticut Philadelphia."                                  // Required for context-recall, context-faithfulness and context-relevance
    }
}

Multimodal Metrics

image-helpfulness

Evaluates helpfulness of image-based response.

multimodal-contextual-relevancy

Checks relevancy across text + image inputs.

multimodal-contextual-recall

Ensures multimodal outputs recall relevant context.

Audio Analysis

audio-sentiment

Detects sentiment in audio.

audio-call-category

Classifies type of call.

audio-language

Detects spoken language.

audio-interruption-handling

Checks ability to handle interruptions gracefully.

audio-false-intent-identification

Detects false or misleading intent in audio input.

Code Evaluation

code-execution

Checks if output code executes without errors.

code-language

Validates if code is written in expected language.

contains-code-language

Ensures output contains code of a specific language.

code-readability

Grades code readability.

code-correctness

Checks correctness of code relative to spec.

Content Moderation

moderation

Flags inappropriate, unsafe, or disallowed content.

Video Metrics

Trusys provides a suite of video evaluation metrics to assess the quality, accuracy, and consistency of AI-generated or model-produced videos. These metrics help benchmark model performance and identify visual, temporal, and factual weaknesses.

video-visual-quality

Evaluates the visual fidelity of the generated video, including aspects like resolution, clarity, color accuracy, and the absence of artifacts such as blurriness, distortion, or pixelation.

text-to-video

Measures semantic alignment between the input text prompt and the generated video content.

video-temporal-consistency

Assesses frame-to-frame continuity to ensure smooth motion, natural transitions, and absence of flickering or scene instability.

video-factual-consistency

Evaluates the factual and logical accuracy of visual content — ensuring that scenes adhere to real-world physics, object integrity, and causal coherence.

video-g-eval

An LLM-based holistic evaluation metric that assesses overall video quality across customizable dimensions such as creativity, realism, alignment, and storytelling. Evaluation Requirements:
  • Generated video.
  • Grading Criteria (mandatory): Text-based rubric defining what aspects the model should evaluate (e.g., “creativity,” “consistency,” “prompt alignment”).

General Metrics (Performance)

bleu

Measures n-gram overlap with reference text.

rouge-n

Evaluates recall of n-grams against reference.

latency

Records response time.

perplexity

Measures fluency and predictability of text.

Example Test Case

{
  "prompt": "What is 5 + 10?",
  "value": {
    "metrics": [
      { "metric": "equals", "expcted_output": "15" },
      { "metric": "greater-than", "expcted_output": 10 },
      { "metric": "latency", "threshold": 2000 }
    ]
  }
}
✅ Output: 15 in under 2 seconds → Passes all metrics ❌ Output: Sixteen or slow response → Fail
This page now covers all metric groups available in Trusys.ai for validating and benchmarking AI outputs.