Skip to main content
Functional evaluation is a critical aspect of ensuring your AI applications perform as expected and deliver accurate, reliable, and safe outputs. Trusys provides a robust Prompt Library feature that allows you to systematically test and evaluate your AI models using a variety of prompts and predefined metrics.
Key Steps
  • Connect your AI application or LLM model
  • Create a prompt library with datasets and variables
  • Run functional evaluation using the prompt library
Trusys enables you to evaluate your AI applications with structured prompt libraries, datasets, and functional evaluations. This helps you assess accuracy, reliability, and safety before deployment. A Functional Evaluation runs prompts or datasets against your connected AI applications or models, scoring responses based on defined metrics.

Run a New Evaluation

To initiate a new evaluation, navigate to the β€˜Test Run’ section and follow these steps:
1

Select an Application

Choose the AI application or LLM model you wish to evaluate from your list of connected applications. This is the target for your test run.
2

Select a Prompt Library

Select the Prompt Library that contains the prompts you want to use for this evaluation. The prompts within this library will serve as the inputs for your chosen application.
3

Run Evaluation

Review the selected application(s) and prompt libraries. Once confirmed, click Run Evaluation to initiate the test run. Trusys will then execute the prompts, collect responses, and evaluate against the defined metrics.
Trusys executes prompts, collects responses, and evaluates them against metrics.

Evaluation Run List

The list view shows:
  • Status (Pending, Running, Completed, Failed)
  • Application evaluated
  • Prompt Library used
  • Start Time of the run
  • Total Prompts executed
  • Passed/Failed Metrics Count
This list helps you monitor the progress of ongoing evaluations and quickly identify completed or failed test runs.

Evaluation Run Details

Click a run to see detailed results:
  • Overall Summary – Pass/fail rate, metric performance
  • Prompt-by-Prompt Analysis – Input Prompt: The exact prompt that was sent to the AI application.
    • AI Response: The response received from your AI application.
    • Metric Results: The individual scores for each metric applied to that specific prompt-response pair, along with whether it passed or failed its expected value.
    • Variable Values: If variables were used, the specific values that were substituted for that test case.
  • Metric Report – Average scores, distributions, pass/fail counts
  • Comparison View (optional) – Compare multiple runs across applications
Use both prompt libraries (for targeted tests) and datasets (for comprehensive benchmarking) to get the most reliable evaluation results.