Metrics in Trusys.ai are evaluation checks you can apply to your model outputs. They help you validate whether a response meets your expectations by comparing the actual output with the expected output under specific rules. Think of metrics as assertions for AI responses—they tell you if the system is behaving correctly according to the criteria you define. Why Metrics Are Important

✅ Consistency – Ensure your model responses align with requirements across multiple test cases.
🛡 Reliability – Catch regressions early by enforcing rules on generated outputs.
⚡ Automation – Run evaluations at scale without manually inspecting every response.
📈 Comparability – Use metrics to measure improvements between different model versions.

How Metrics Work

When you run a test in Trusys, you define one or more metrics per test case. The system evaluates the model output against these rules. If the metric passes, the test is marked successful; if not, it fails. Each metric comes with:

Name: The type of check (e.g., equals, contains).
Description: What the check does.
Value: The expected reference value(s) or grading criteria etc

Available Metrics

Matching & Validation

Text Matching

These metrics check for textual similarity or presence.

`equals`

Checks if the output exactly matches the expected value.

{ "metric": "equals", "value": "15" }

✅ Output: 15 → Pass

`contains`

Checks if the output contains the expected substring.

{ "metric": "contains", "value": "hello world" }

✅ Output: Hi, hello world! → Pass

`icontains`

Case-insensitive substring check.

{ "metric": "icontains", "value": "Hello" }

✅ Output: hello everyone → Pass

`contains-all`

Checks if output contains all provided strings.

{ "metric": "contains-all", "value": ["apple", "banana"] }

✅ Output: apple and banana smoothie → Pass

`contains-any`

Checks if output contains at least one provided string.

{ "metric": "contains-any", "value": ["success", "completed"] }

✅ Output: The task was completed → Pass

Negated Assertions

The inverse of text-matching metrics.

`not-equals`

Passes if output does not equal expected value.

{ "metric": "not-equals", "value": "error" }

✅ Output: success → Pass

`not-contains`

Fails if output contains forbidden text.

{ "metric": "not-contains", "value": "forbidden" }

✅ Output: allowed content → Pass

Numeric Validation

`equals-number`

Checks if numeric output matches exactly.

{ "metric": "equals-number", "value": 42 }

✅ Output: 42 → Pass

`greater-than`

Ensures output number > expected.

{ "metric": "greater-than", "value": 10 }

✅ Output: 15 → Pass

`less-than`

Ensures output number < expected.

{ "metric": "less-than", "value": 100 }

✅ Output: 42 → Pass

Format Validation

Checks structural correctness of outputs.

`is-json`

Validates output is a valid JSON.

{ "metric": "is-json" }

✅ Output: { "a": 1 } → Pass

`contains-json`

Checks if JSON structure is embedded inside output.

`is-xml`, `contains-xml`, `is-sql`

Validate if output conforms to XML/SQL formats.

JSON / Structured Validation

`json-equals`

Validates JSON equality.

{ "metric": "json-equals", "value": {"status": "ok"} }

✅ Output: { "status": "ok" } → Pass

`array-length`

Checks array length.

{ "metric": "array-length", "value": 3 }

✅ Output: [1,2,3] → Pass

LLM as a Judge Metrics

`similar`

Checks semantic similarity with threshold.

{ "metric": "similar", "value": "Paris is capital of France", "threshold": 0.8 }

`llm-rubric`

Grades output based on custom rubric via an LLM.

`factuality`

Assesses factual correctness relative to ground truth.

`g-eval`

Generalized evaluation using a language model.

Conversational Evaluation

`conversational-g-eval`

Checks conversation quality holistically.

`knowledge-retention`

Ensures model retains context across conversation.

`role-adherence`

Validates if assistant maintains defined role.

`conversation-relevancy`

Checks if responses are contextually relevant.

RAG Metrics

RAG metrics evaluate different aspects of the retrieval-augmented generation pipeline, from the relevance of retrieved context to the faithfulness of generated answers.

`context-faithfulness`

Purpose: Ensures the generated answer doesn’t hallucinate information beyond what’s provided in the retrieved context. Use Case: Critical for applications where factual accuracy is paramount, such as medical, legal, or financial systems. Required Fields

query: The user’s original question
context: Retrieved text (provided via trusys field) or user input See Implementation Requirements below to learn how to provide context information in the trusys field

`context-recall`

Purpose: Measures how much of the retrieved context was actually utilized in generating the answer. Use Case: Validates that your retrieval system is finding relevant, useful information rather than just tangentially related content. Required Fields

value: The expected answer or ground truth
context: Retrieved text (provided via trusys field) Or user input.

See Implementation Requirements below to learn how to provide context information in the trusys field

`context-relevance`

Purpose: Evaluates the quality of retrieved context by measuring how much of it is actually needed to answer the query. Use Case: Helps optimize retrieval systems by identifying when too much irrelevant information is being retrieved. Required Fields

query: The user’s original question
context: Retrieved text (provided via trusys field) or user input.

See Implementation Requirements below to learn how to provide context information in the trusys field

`hallucination`

Purpose: Measures whether the generated output introduces information not supported by the provided context. This helps ensure that the system does not fabricate facts, claims, or details that could mislead end-users. Use Case: Crucial for domains where factual precision is required—such as regulatory, financial, legal, healthcare, or enterprise settings—where hallucinated information can introduce operational or compliance risks. Required Fields

context: The retrieved text or user-provided information the model is expected to rely on. Provided via the trusys.context field. See Implementation Requirements below to learn how to provide context information in the trusys field

Implementation Requirements for Agentic Metrics

To enable agentic metrics- context-related metrics (context-faithfulness, context-relevance, context-recall), your application must include additional metadata in its responses.For JSON Responses: If your application already returns JSON, add a trusys field:

"trusys": {                                                         // Required
    "query": "Which cities are near new york?",                                        // Required for context-faithfulness and context-relevance
    "context": "New jersey Connecticut Philadelphia."                                  // Required for context-recall, context-faithfulness and context-relevance
}

For Non-JSON Responses: If your application returns plain text or other formats, wrap the response in JSON:

{
    "output": "<original_output>",                                                          // Required
    "trusys": {
        "query": "Which cities are near new york?",                                        // Required for context-faithfulness and context-relevance
        "context": "New jersey Connecticut Philadelphia."                                  // Required for context-recall, context-faithfulness and context-relevance
    }
}

Agentic Metrics

Agentic metrics evaluate the performance of AI agents that can use tools and complete complex tasks. These metrics assess both the technical correctness of tool usage and the overall task completion effectiveness.

`tool-correctness`

Purpose: Verifies whether an agent correctly invoked tools with appropriate parameters and usage patterns. Use Case: Essential for ensuring agents can reliably interact with external systems, APIs, and services. Required Data

expected tools called: List of tools extpected to be called
Actual tool invocation data must be provided in: trusys.tool_calls (array of tool call objects). See Implementation Requirements below.

`task-completion`

Purpose: Assesses whether an agent successfully completed its assigned task, regardless of the specific tools used. Use Case: Measures end-to-end agent effectiveness for complex, multi-step workflows. Required Data

All tool invocations must be captured in trusys.tool_calls including including: name, description,arguments and output. See Implementation Requirements below.

`argument-correctness`

Purpose: Evaluates whether an agent provided correct and appropriate arguments when invoking tools, ensuring proper parameter usage and data types. Use Case: Critical for validating that agents can correctly interact with APIs and tools by providing valid, well-formed parameters that match the expected schema. Required Data

All tool invocations must be captured in trusys.tool_calls including including: name, description and arguments. See Implementation Requirements below.

Implementation Requirements for RAG Metrics

To enable tool-related metrics (tool-correctness, task-completion) and context-related metrics (context-faithfulness, context-relevance, context-recall), your application must include additional metadata in its responses.For JSON Responses: If your application already returns JSON, add a trusys field:

"trusys": {                                                         
    "tool_calls": [
        {
            "name": "weather_tool",                                                     // Required 
            "description": "Provides weather information for a given location.",        //Required for Task completion & Argument Correctness
            "arguments": {"city": "New York", "state": "US"},                           //Required for Task completion & Argument Correctness
            "output": ["The current weather in New York is sunny with a temperature of 75°F."]      // Required for Task completion
        }
    ]
}

For Non-JSON Responses: If your application returns plain text or other formats, wrap the response in JSON:

{
    "output": "<original_output>",                                                          // Required
    "trusys": {
        "tool_calls": [
            {
                "name": "weather_tool",                                                     // Required 
                "description": "Provides weather information for a given location.",        //Required for Task completion & Argument Correctness
                "arguments": {"city": "New York", "state": "US"},                           //Required for Task completion & Argument Correctness
                "output": ["The current weather in New York is sunny with a temperature of 75°F."]      // Required for Task completion
            }
        ]
    }
}

Multimodal Metrics

`image-helpfulness`

Evaluates helpfulness of image-based response.

`multimodal-contextual-relevancy`

Checks relevancy across text + image inputs.

`multimodal-contextual-recall`

Ensures multimodal outputs recall relevant context.

Audio Analysis

`audio-sentiment`

Detects sentiment in audio.

`audio-call-category`

Classifies type of call.

`audio-language`

Detects spoken language.

`audio-interruption-handling`

Checks ability to handle interruptions gracefully.

`audio-false-intent-identification`

Detects false or misleading intent in audio input.

Code Evaluation

`code-execution`

Checks if output code executes without errors.

`code-language`

Validates if code is written in expected language.

`contains-code-language`

Ensures output contains code of a specific language.

`code-readability`

Grades code readability.

`code-correctness`

Checks correctness of code relative to spec.

Content Moderation

`moderation`

Flags inappropriate, unsafe, or disallowed content.

Video Metrics

Trusys provides a suite of video evaluation metrics to assess the quality, accuracy, and consistency of AI-generated or model-produced videos. These metrics help benchmark model performance and identify visual, temporal, and factual weaknesses.

`video-visual-quality`

Evaluates the visual fidelity of the generated video, including aspects like resolution, clarity, color accuracy, and the absence of artifacts such as blurriness, distortion, or pixelation.

`text-to-video`

Measures semantic alignment between the input text prompt and the generated video content.

`video-temporal-consistency`

Assesses frame-to-frame continuity to ensure smooth motion, natural transitions, and absence of flickering or scene instability.

`video-factual-consistency`

Evaluates the factual and logical accuracy of visual content — ensuring that scenes adhere to real-world physics, object integrity, and causal coherence.

`video-g-eval`

An LLM-based holistic evaluation metric that assesses overall video quality across customizable dimensions such as creativity, realism, alignment, and storytelling. Evaluation Requirements:

Generated video.
Grading Criteria (mandatory): Text-based rubric defining what aspects the model should evaluate (e.g., “creativity,” “consistency,” “prompt alignment”).

General Metrics (Performance)

`bleu`

Measures n-gram overlap with reference text.

`rouge-n`

Evaluates recall of n-grams against reference.

`latency`

Records response time.

`perplexity`

Measures fluency and predictability of text.

Example Test Case

{
  "prompt": "What is 5 + 10?",
  "value": {
    "metrics": [
      { "metric": "equals", "expcted_output": "15" },
      { "metric": "greater-than", "expcted_output": 10 },
      { "metric": "latency", "threshold": 2000 }
    ]
  }
}

✅ Output: 15 in under 2 seconds → Passes all metrics ❌ Output: Sixteen or slow response → Fail

This page now covers all metric groups available in Trusys.ai for validating and benchmarking AI outputs.

Function Evaluation

Security Evaluation

Monitoring

Use Cases

​How Metrics Work

​Available Metrics

​Matching & Validation

​Text Matching

​equals

​contains

​icontains

​contains-all

​contains-any

​Negated Assertions

​not-equals

​not-contains

​Numeric Validation

​equals-number

​greater-than

​less-than

​Format Validation

​is-json

​contains-json

​is-xml, contains-xml, is-sql

​JSON / Structured Validation

​json-equals

​array-length

​LLM as a Judge Metrics

​similar

​llm-rubric

​factuality

​g-eval

​Conversational Evaluation

​conversational-g-eval

​knowledge-retention

​role-adherence

​conversation-relevancy

​RAG Metrics

​context-faithfulness

​context-recall

​context-relevance

​hallucination

​Implementation Requirements for Agentic Metrics

​Agentic Metrics

​tool-correctness

​task-completion

​argument-correctness

​Implementation Requirements for RAG Metrics

​Multimodal Metrics

​image-helpfulness

​multimodal-contextual-relevancy

​multimodal-contextual-recall

​Audio Analysis

​audio-sentiment

​audio-call-category

​audio-language

​audio-interruption-handling

​audio-false-intent-identification

​Code Evaluation

​code-execution

​code-language

​contains-code-language

​code-readability

​code-correctness

​Content Moderation

​moderation

​Video Metrics

​video-visual-quality

​text-to-video

​video-temporal-consistency

​video-factual-consistency

​video-g-eval

​General Metrics (Performance)

​bleu

​rouge-n

​latency

​perplexity

​Example Test Case