Skip to main content
Metrics in Trusys.ai are evaluation checks you can apply to your model outputs. They help you validate whether a response meets your expectations by comparing the actual output with the expected output under specific rules. Think of metrics as assertions for AI responsesβ€”they tell you if the system is behaving correctly according to the criteria you define. Why Metrics Are Important
  • βœ… Consistency – Ensure your model responses align with requirements across multiple test cases.
  • πŸ›‘ Reliability – Catch regressions early by enforcing rules on generated outputs.
  • ⚑ Automation – Run evaluations at scale without manually inspecting every response.
  • πŸ“ˆ Comparability – Use metrics to measure improvements between different model versions.

How Metrics Work

When you run a test in Trusys, you define one or more metrics per test case. The system evaluates the model output against these rules. If the metric passes, the test is marked successful; if not, it fails. Each metric comes with:
  • Name: The type of check (e.g., equals, contains).
  • Description: What the check does.
  • Value: The expected reference value(s) or grading criteria etc

LLM as a Judge Metrics

similar

Checks semantic similarity with threshold.
{ "metric": "similar", "value": "Paris is capital of France", "threshold": 0.8 }

llm-rubric

Grades output based on custom rubric via an LLM.

factuality

Assesses factual correctness relative to ground truth.

g-eval

Generalized evaluation using a language model.

Conversational Evaluation

conversational-g-eval

Checks conversation quality holistically.

knowledge-retention

Ensures model retains context across conversation.

role-adherence

Validates if assistant maintains defined role.

conversation-relevancy

Checks if responses are contextually relevant.

RAG Metrics

RAG metrics evaluate different aspects of the retrieval-augmented generation pipeline, from the relevance of retrieved context to the faithfulness of generated answers.

context-faithfulness

Purpose: Ensures the generated answer doesn’t hallucinate information beyond what’s provided in the retrieved context. Use Case: Critical for applications where factual accuracy is paramount, such as medical, legal, or financial systems. Required Fields
  • query: The user’s original question
  • context: Retrieved text (provided via trusys field) or user input See Implementation Requirements below to learn how to provide context information in the trusys field

context-recall

Purpose: Measures how much of the retrieved context was actually utilized in generating the answer. Use Case: Validates that your retrieval system is finding relevant, useful information rather than just tangentially related content. Required Fields
  • value: The expected answer or ground truth
  • context: Retrieved text (provided via trusys field) Or user input.
See Implementation Requirements below to learn how to provide context information in the trusys field

context-relevance

Purpose: Evaluates the quality of retrieved context by measuring how much of it is actually needed to answer the query. Use Case: Helps optimize retrieval systems by identifying when too much irrelevant information is being retrieved. Required Fields
  • query: The user’s original question
  • context: Retrieved text (provided via trusys field) or user input.
See Implementation Requirements below to learn how to provide context information in the trusys field

hallucination

Purpose: Measures whether the generated output introduces information not supported by the provided context. This helps ensure that the system does not fabricate facts, claims, or details that could mislead end-users. Use Case: Crucial for domains where factual precision is requiredβ€”such as regulatory, financial, legal, healthcare, or enterprise settingsβ€”where hallucinated information can introduce operational or compliance risks. Required Fields
  • context: The retrieved text or user-provided information the model is expected to rely on. Provided via the trusys.context field. See Implementation Requirements below to learn how to provide context information in the trusys field

Implementation Requirements for RAG Metrics

To enable RAG metrics- context-related metrics (context-faithfulness, context-relevance, context-recall, β€˜hallucination`), your application must include additional metadata in its responses.For JSON Responses: If your application already returns JSON, add a trusys field:
"trusys": {                                                         // Required
    "query": "Which cities are near new york?",                                        // Required for context-faithfulness and context-relevance
    "context": "New jersey Connecticut Philadelphia."                                  // Required for context-recall, context-faithfulness and context-relevance
}
For Non-JSON Responses: If your application returns plain text or other formats, wrap the response in JSON:
{
    "output": "<original_output>",                                                          // Required
    "trusys": {
        "query": "Which cities are near new york?",                                        // Required for context-faithfulness and context-relevance
        "context": "New jersey Connecticut Philadelphia."                                  // Required for context-recall, context-faithfulness and context-relevance
    }
}

Agentic Metrics

Agentic metrics evaluate the performance of AI agents that can use tools and complete complex tasks. These metrics assess both the technical correctness of tool usage and the overall task completion effectiveness.

tool-correctness

Purpose: Verifies whether an agent correctly invoked tools with appropriate parameters and usage patterns. Use Case: Essential for ensuring agents can reliably interact with external systems, APIs, and services. Required Data
  • expected tools called: List of tools extpected to be called
  • Actual tool invocation data must be provided in: trusys.tool_calls (array of tool call objects). See Implementation below.

task-completion

Purpose: Assesses whether an agent successfully completed its assigned task, regardless of the specific tools used. Use Case: Measures end-to-end agent effectiveness for complex, multi-step workflows. Required Data
  • All tool invocations must be captured in trusys.tool_calls including including: name, description,arguments and output. See Implementation below.

argument-correctness

Purpose: Evaluates whether an agent provided correct and appropriate arguments when invoking tools, ensuring proper parameter usage and data types. Use Case: Critical for validating that agents can correctly interact with APIs and tools by providing valid, well-formed parameters that match the expected schema. Required Data
  • All tool invocations must be captured in trusys.tool_calls including including: name, description and arguments. See Implementation below.

plan-quality

Purpose: Evaluates the quality and completeness of the agent’s plan for solving the task. Use Case: Ensures the agent creates a logical, structured, and effective plan before execution. Required Data
  • prompt
  • plan (must be provided in trusys.plan)
  • agent/tool calls
See Implementation below.

plan-adherence

Purpose: Measures how closely the agent follows the defined plan during execution. Use Case: Detects deviations between planned steps and actual execution. Required Data
  • prompt
  • plan (required)
  • agent/tool calls
See Implementation below.

goal-fulfillment

Purpose: Determines whether the agent successfully achieved all user goals. Use Case: Validates end-to-end success beyond just tool execution. Required Data
  • prompt
  • output
  • plan (optional but recommended)
  • agent/tool calls
See Implementation below.

execution-efficiency

Purpose: Measures how efficiently the agent completes a task (e.g., avoiding redundant steps or unnecessary tool calls). Use Case: Helps optimize agent workflows and reduce cost/latency. Required Data
  • prompt
  • output
  • plan (optional)
  • agent/tool calls
See Implementation below.

reasoning-coherence

Purpose: Evaluates whether the agent’s reasoning steps are logically consistent and aligned with its plan. Use Case: Ensures the agent follows a coherent thought process across steps. Required Data
  • prompt
  • output
  • plan (optional)
  • agent/tool calls
See Implementation below.

agent-llm-rubric

Purpose: Evaluates agent behavior using custom user-defined criteria. Use Case: Define rules such as:
  • β€œIf tool X is called, do not call agent Y”
  • β€œAgent must validate input before tool execution”
Required Data
  • prompt
  • output
  • plan (optional)
  • Custom grading criterias
  • agent/tool calls
See Implementation below.

tool-selection-quality

Purpose: Determines whether the agent selected the correct tools and used appropriate arguments. Use Case: Ensures optimal tool choice and prevents misuse of tools. Required Data
  • prompt
  • agent/tool calls
See Implementation below.

Implementation Requirements for Agentic Metrics

To enable Agentic metrics (tool-correctness, task-completion, arguement-correctness), your application must include additional metadata in its responses.For JSON Responses: If your application already returns JSON, add a trusys field:
"trusys": {                                                         
    "tool_calls": [
        {
            "name": "weather_tool",                                                     // Required 
            "description": "Provides weather information for a given location.",        //Required for Task completion & Argument Correctness
            "arguments": {"city": "New York", "state": "US"},                           //Required for Task completion & Argument Correctness
            "output": ["The current weather in New York is sunny with a temperature of 75Β°F."]      // Required for Task completion
        }
    ],
    "plan": [
      "Identify user intent (weather query)",
      "Select weather tool",
      "Call weather API with city parameter",
      "Return formatted response"
    ],
    "agent_calls": [
      {
        "name": "weather_agent",
        "description": "Handles weather-related queries",
        "input": "Get weather for New York",
        "output": "Sunny, 75Β°F"
      }
    ]
}
For Non-JSON Responses: If your application returns plain text or other formats, wrap the response in JSON:
{
    "output": "<original_output>",                                                          // Required
    "trusys": {
        "tool_calls": [
            {
                "name": "weather_tool",                                                     // Required 
                "description": "Provides weather information for a given location.",        //Required for Task completion & Argument Correctness
                "arguments": {"city": "New York", "state": "US"},                           //Required for Task completion & Argument Correctness
                "output": ["The current weather in New York is sunny with a temperature of 75Β°F."]      // Required for Task completion
            }
        ],
    "plan": [
      "Identify user intent (weather query)",
      "Select weather tool",
      "Call weather API with city parameter",
      "Return formatted response"
    ],
    "agent_calls": [
      {
        "name": "weather_agent",
        "description": "Handles weather-related queries",
        "input": "Get weather for New York",
        "output": "Sunny, 75Β°F"
      }
    ]
    }
}

Multimodal Metrics

image-helpfulness

Evaluates helpfulness of image-based response.

multimodal-contextual-relevancy

Checks relevancy across text + image inputs.

multimodal-contextual-recall

Ensures multimodal outputs recall relevant context.

Audio Analysis

audio-sentiment

Analyzes the emotional tone of the speaker in the audio (e.g., positive, negative, neutral). Use Case: Helps understand user satisfaction, detect frustration, and improve conversational experience in voice interactions.

audio-call-category

Classifies the type or purpose of the call (e.g., support, sales, complaint, inquiry). Use Case: Enables better routing, analytics, and performance tracking across different call types.

audio-language

Identifies the primary language spoken in the audio. Use Case: Supports multilingual applications by ensuring correct language detection and response handling.

audio-interruption-handling

Evaluates how well the system manages interruptions during a conversation (e.g., user speaking over the agent or vice versa). Use Case: Ensures smooth conversational flow and natural turn-taking in real-time voice interactions.

audio-false-intent-identification

Detects whether the system incorrectly interprets or assigns user intent from the audio input. Use Case:
Helps identify misclassification of user intent, reducing errors in downstream actions and improving overall accuracy.
Verifies whether proper user consent is obtained during the interaction. Use Case: Critical for regulated industries (e.g., finance, healthcare) where explicit consent is required before proceeding.

audio-drop-out-rates

Measures the frequency of dropped calls, silences, or incomplete interactions. Use Case: Helps identify system reliability issues or poor user experience due to interruptions.

audio-agent-response-time

Calculates how quickly the AI agent responds during a conversation. Use Case: Ensures low latency in voice interactions, improving user experience in real-time systems like support calls.

audio-talk-to-listen-ratio

Evaluates the ratio of agent speaking time vs user speaking time. Use Case: Useful for analyzing conversational balance β€” ensuring the agent is not overly dominant or unresponsive.

audio-conversation-coherence

Assesses whether the conversation flows logically across multiple turns. Use Case: Detects breakdowns in dialogue continuity, context loss, or irrelevant responses.

audio-disclaimer-check

Checks whether required disclaimers are present in the conversation. Use Case: Ensures compliance with legal or policy requirements (e.g., β€œThis call is being recorded”).

audio-language-switch

Detects transitions between languages during a conversation. Use Case: Important for multilingual applications to ensure smooth language handling and avoid unintended switches.

Code Evaluation

code-execution

Checks if output code executes without errors.

code-language

Validates if code is written in expected language.

contains-code-language

Ensures output contains code of a specific language.

code-readability

Grades code readability.

code-correctness

Checks correctness of code relative to spec.

Content Moderation

moderation

Flags inappropriate, unsafe, or disallowed content.

Video Metrics

Trusys provides a suite of video evaluation metrics to assess the quality, accuracy, and consistency of AI-generated or model-produced videos. These metrics help benchmark model performance and identify visual, temporal, and factual weaknesses.

video-visual-quality

Evaluates the visual fidelity of the generated video, including aspects like resolution, clarity, color accuracy, and the absence of artifacts such as blurriness, distortion, or pixelation.

text-to-video

Measures semantic alignment between the input text prompt and the generated video content.

video-temporal-consistency

Assesses frame-to-frame continuity to ensure smooth motion, natural transitions, and absence of flickering or scene instability.

video-factual-consistency

Evaluates the factual and logical accuracy of visual content β€” ensuring that scenes adhere to real-world physics, object integrity, and causal coherence.

video-g-eval

An LLM-based holistic evaluation metric that assesses overall video quality across customizable dimensions such as creativity, realism, alignment, and storytelling. Evaluation Requirements:
  • Generated video.
  • Grading Criteria (mandatory): Text-based rubric defining what aspects the model should evaluate (e.g., β€œcreativity,” β€œconsistency,” β€œprompt alignment”).

General Metrics (Performance)

bleu

Measures n-gram overlap with reference text.

rouge-n

Evaluates recall of n-grams against reference.

latency

Records response time.

perplexity

Measures fluency and predictability of text.

Example Test Case

{
  "prompt": "What is 5 + 10?",
  "value": {
    "metrics": [
      { "metric": "equals", "expcted_output": "15" },
      { "metric": "greater-than", "expcted_output": 10 },
      { "metric": "latency", "threshold": 2000 }
    ]
  }
}
βœ… Output: 15 in under 2 seconds β†’ Passes all metrics ❌ Output: Sixteen or slow response β†’ Fail

Matching & Validation

Text Matching

These metrics check for textual similarity or presence.

equals

Checks if the output exactly matches the expected value.
{ "metric": "equals", "value": "15" }
βœ… Output: 15 β†’ Pass

contains

Checks if the output contains the expected substring.
{ "metric": "contains", "value": "hello world" }
βœ… Output: Hi, hello world! β†’ Pass

icontains

Case-insensitive substring check.
{ "metric": "icontains", "value": "Hello" }
βœ… Output: hello everyone β†’ Pass

contains-all

Checks if output contains all provided strings.
{ "metric": "contains-all", "value": ["apple", "banana"] }
βœ… Output: apple and banana smoothie β†’ Pass

contains-any

Checks if output contains at least one provided string.
{ "metric": "contains-any", "value": ["success", "completed"] }
βœ… Output: The task was completed β†’ Pass

Negated Assertions

The inverse of text-matching metrics.

not-equals

Passes if output does not equal expected value.
{ "metric": "not-equals", "value": "error" }
βœ… Output: success β†’ Pass

not-contains

Fails if output contains forbidden text.
{ "metric": "not-contains", "value": "forbidden" }
βœ… Output: allowed content β†’ Pass

Numeric Validation

equals-number

Checks if numeric output matches exactly.
{ "metric": "equals-number", "value": 42 }
βœ… Output: 42 β†’ Pass

greater-than

Ensures output number > expected.
{ "metric": "greater-than", "value": 10 }
βœ… Output: 15 β†’ Pass

less-than

Ensures output number < expected.
{ "metric": "less-than", "value": 100 }
βœ… Output: 42 β†’ Pass

Format Validation

Checks structural correctness of outputs.

is-json

Validates output is a valid JSON.
{ "metric": "is-json" }
βœ… Output: { "a": 1 } β†’ Pass

contains-json

Checks if JSON structure is embedded inside output.

is-xml, contains-xml, is-sql

Validate if output conforms to XML/SQL formats.

JSON / Structured Validation

json-equals

Validates JSON equality.
{ "metric": "json-equals", "value": {"status": "ok"} }
βœ… Output: { "status": "ok" } β†’ Pass

array-length

Checks array length.
{ "metric": "array-length", "value": 3 }
βœ… Output: [1,2,3] β†’ Pass
This page now covers all metric groups available in Trusys.ai for validating and benchmarking AI outputs.