Key StepsTrusys enables you to evaluate your AI applications with structured prompt libraries, datasets, and functional evaluations. This helps you assess accuracy, reliability, and safety before deployment. A Functional Evaluation runs prompts or datasets against your connected AI applications or models, scoring responses based on defined metrics.
- Connect your AI application or LLM model
- Create a prompt library with datasets and variables
- Run functional evaluation using the prompt library
Run a New Evaluation
To initiate a new evaluation, navigate to the βTest Runβ section and follow these steps:1
Select an Application
Choose the AI application or LLM model you wish to evaluate from your list of connected applications. This is the target for your test run.2
Select a Prompt Library
Select the Prompt Library that contains the prompts you want to use for this evaluation. The prompts within this library will serve as the inputs for your chosen application.3
Run Evaluation
Review the selected application(s) and prompt libraries. Once confirmed, click Run Evaluation to initiate the test run. Trusys will then execute the prompts, collect responses, and evaluate against the defined metrics.
Evaluation Run List
The list view shows:- Status (Pending, Running, Completed, Failed)
- Application evaluated
- Prompt Library used
- Start Time of the run
- Total Prompts executed
- Passed/Failed Metrics Count
Evaluation Run Details
Click a run to see detailed results:- Overall Summary β Pass/fail rate, metric performance
- Prompt-by-Prompt Analysis
β Input Prompt: The exact prompt that was sent to the AI application.
- AI Response: The response received from your AI application.
- Metric Results: The individual scores for each metric applied to that specific prompt-response pair, along with whether it passed or failed its expected value.
- Variable Values: If variables were used, the specific values that were substituted for that test case.
- Metric Report β Average scores, distributions, pass/fail counts
- Comparison View (optional) β Compare multiple runs across applications
Use both prompt libraries (for targeted tests) and datasets (for comprehensive benchmarking) to get the most reliable evaluation results.