Optimization Playground - trusys Documentation

The Optimization Playground is a comprehensive tool within Trusys for automatically improving your AI application’s performance. Instead of manually tweaking prompts or parameters, the Optimization Playground systematically tests different configurations and finds the settings that perform best against your chosen evaluation metrics. The Optimization Playground offers three optimization approaches:

Prompt Optimization - Automatically improve your system prompt while keeping the structure fixed
Example Optimization - Find the optimal number and combination of few-shot examples to include in your system prompt
Parameter Optimization - Find the best LLM settings (temperature, top_p, etc.) for your specific task

Each approach systematically tests different variations and measures their performance using your chosen metric.

Experimentation Tab

The Experimentation tab is where you set up and run optimization experiments. Select your optimization type and follow the steps below.

1. Prompt Optimization

Optimize your system prompt to improve performance against your evaluation metric. Setup Steps:

Select Optimization Type - Choose “Prompt Optimization”
Select Application
- Choose your LLM model (e.g., GPT-4, Claude, Ollama, etc.)
- This is the model that will be tested with different system prompt variations
Enter System Prompt
- Provide your current system prompt
- This serves as the baseline for optimization
- Trusys will generate variations of this prompt to test
Select Prompt Library
- Choose the prompt library containing your test cases
- All prompts in the library will be tested against each system prompt variation
- Ensure your library includes diverse, representative test cases
Select Metric
- Choose the evaluation metric to measure performance
- Options include accuracy, relevance, coherence, and custom metrics
- The metric score determines which system prompt is best
Configure Optimization Settings
- Number of Trials: Default 20-30 (higher = more thorough, longer time)
- Parallel Threads: 4-8 threads for evaluation
- Local Refinement: Percentage of trials for fine-tuning (default 0.3)
Run Optimization
- Click Run to start the optimization process
- Trusys creates multiple trials, each with a slightly different system prompt
- Progress is shown as trials complete

How It Works:

Each trial tests a unique variation of your system prompt
All prompts in your prompt library are evaluated with that system prompt variation
An average metric score is calculated across all prompt-response pairs
Once all trials complete, the system prompt with the highest average metric score is selected as the optimal version

Example Workflow: If you run 10 trials with a prompt library of 50 test cases:

Trial 1: System prompt variation A × 50 test cases = average score of 0.82
Trial 2: System prompt variation B × 50 test cases = average score of 0.85
Trial 3: System prompt variation C × 50 test cases = average score of 0.83
… (continues for all 10 trials)
Best Result: Trial 2’s system prompt with score 0.85

2. Example Optimization

Find the optimal number and selection of few-shot examples to include in your system prompt. Setup Steps:

Select Optimization Type - Choose “Example Optimization”
Select Application
- Choose your LLM model
- This model will evaluate prompts with different few-shot examples
Enter System Prompt
- Provide your base system prompt (without examples)
- Examples will be dynamically added to this prompt during optimization
Select Dataset
- Choose a dataset containing question-answer pairs
- This dataset provides both the examples to include and the test prompts
- The dataset is split: some entries become few-shot examples, others become test cases
Select Metric
- Choose the evaluation metric (accuracy, relevance, etc.)
- The metric determines which example configuration is best
Configure Optimization Settings
- Number of Trials: Default 10-15
- Min Examples: Minimum few-shot examples to include (default: 2)
- Max Examples: Maximum few-shot examples to include (default: 8)
- Parallel Threads: Number of parallel evaluation threads
Run Optimization
- Click Run to start the optimization
- Trusys creates trials with different numbers and combinations of examples

How It Works:

Each trial selects a specific subset of examples from your dataset to use as few-shot examples
The remaining dataset items are used as test prompts
The system prompt + selected examples are sent with each test prompt
The model’s responses are evaluated against the test prompts using your metric
An average metric score is calculated for that example configuration
Once all trials complete, the system prompt with the optimal examples is selected

Example Workflow: If you run 8 trials with a dataset of 100 items:

Trial 1: Add 2 examples → test remaining 98 items → average score 0.79
Trial 2: Add 4 examples → test remaining 96 items → average score 0.83
Trial 3: Add 6 examples → test remaining 94 items → average score 0.85
Trial 4: Add 8 examples → test remaining 92 items → average score 0.82
… (more trials testing different combinations)
Best Result: Trial 3 with 6 examples and score 0.85

3. Parameter Optimization

Automatically find the best LLM settings (temperature, top_p, etc.) for your specific task. Setup Steps:

Select Optimization Type - Choose “Parameter Optimization”
Select Application
- Choose your LLM model (OpenAI, Anthropic, Google, Azure, Ollama, etc.)
- This is the model whose parameters will be optimized
Select Prompt Library or Dataset
- Prompt Library: Use your predefined test cases with assertions
- Dataset: Use a dataset with question-answer pairs
- All items will be tested with each parameter combination
Select Metric
- Choose how to measure performance (accuracy, relevance, coherence, etc.)
- The metric score determines which parameter settings are best
Define Parameters to Optimize
- Select which LLM parameters to tune from available options:
Temperature (0.0 to 2.0)
- Controls randomness in responses
- Low (0.0-0.5): More focused, deterministic (good for factual tasks)
- High (0.8-2.0): More creative, varied (good for creative tasks)
Top P (0.0 to 1.0)
- Controls diversity through probability mass
- Low (0.1-0.5): More focused, predictable
- High (0.8-1.0): More diverse, creative
Frequency Penalty (-2.0 to 2.0)
- Penalizes repetitive content
- Negative: Encourages repetition
- Positive: Reduces repetition
Presence Penalty (-2.0 to 2.0)
- Penalizes token reuse
- Negative: Encourages reusing tokens
- Positive: Encourages new tokens
Max Tokens (1 to 4000+)
- Controls response length
- Lower: Shorter, more concise
- Higher: Longer, more detailed
Set Parameter Ranges
- For each parameter, define the search range
- Example: Temperature 0.0 to 1.5, Top P 0.5 to 1.0
- Use quantization (step sizes) to reduce search space
Tips:
- Start conservative with narrower ranges
- Center ranges around your current parameter values
- Use step increments (e.g., temperature in 0.1 increments)
Configure Optimization Settings
- Number of Trials: Default 20-30 for 2-3 parameters
  - 2 parameters: 20-30 trials
  - 3-4 parameters: 30-50 trials
  - 5+ parameters: 50-100 trials
- Parallel Threads: 4-8 (balance with API rate limits)
- Local Refinement Ratio: 0.3 (30% of trials)
  - Higher (0.4-0.5): More refinement near best region
  - Lower (0.1-0.2): More broad exploration
Run Optimization
- Click Run to start
- Baseline trial runs with current settings
- Systematic trials test different parameter combinations
- Best parameter combination is highlighted when complete

How It Works:

A baseline trial establishes your starting performance
The optimizer systematically explores different parameter combinations
Each combination is tested against all prompts using your metric
An average metric score is calculated for each parameter set
The optimizer balances exploration (trying diverse settings) with exploitation (refining promising configurations)
Parameter importance is calculated to show which parameters had the most impact
The best parameter configuration is returned with detailed analysis

Common Parameter Combinations: For Factual/Accurate Responses:

Temperature: 0.3-0.5
Top P: 0.8-0.9
Frequency Penalty: 0.3-0.5
Max Tokens: 1000-2000

For Creative/Varied Responses:

Temperature: 1.0-1.5
Top P: 0.9-1.0
Frequency Penalty: 0.8-1.2
Max Tokens: 2000-3000

For Concise Responses:

Temperature: 0.5-0.8
Top P: 0.8-0.95
Frequency Penalty: 0.3-0.5
Max Tokens: 500-1000

Run Experiments Tab

The Run Experiments tab displays all your completed optimization experiments with detailed results.

View All Experiments

This section shows a list of all optimization runs. For each experiment, you can see:

Experiment Name - Name of the optimization run
Type - Prompt, Example, or Parameter Optimization
Application - Which LLM model was used
Status - Completed, Running, or Failed
Date Created - When the experiment was run
Number of Trials - How many variations were tested
Best Trail - The trail with highest metric score

View Trial Details

Click on any experiment to see detailed information about each trial: Trial Information:

Trial Number - Sequential trial identifier
Configuration - System prompt, examples, or parameter values used
Metric Score - The average score for that trial’s configuration
Status - Pass/Fail status

Response Details:

User Prompt - The test prompt sent to the model
Model Response - The output generated by the model
Metric Score - The evaluation score for that specific response
Reasoning - Explanation of the score

Analyze Results

Compare Configurations:

View side-by-side comparison of different trials
See how metric scores change across variations
Identify which changes had the most impact

Identify Patterns:

Look at which configurations performed best
Understand what characteristics correlate with higher scores
For parameter optimization, see parameter importance rankings

Export and Share:

Download detailed results for reporting
Share optimization findings with your team
Use results to inform future engineering

Best Practices

Prompt Optimization

Start with a Strong Baseline - Your initial system prompt should be reasonably good
Use Representative Test Cases - Your prompt library should reflect real-world usage
Choose the Right Metric - Select a metric that directly measures success for your task
Review Top Variations - Look at top 3-5 system prompts, not just the #1
Validate on New Data - Test the best system prompt on a fresh dataset

Example Optimization

Use Quality Examples - Your dataset examples should be high-quality and representative
Use Diverse Examples - Include a variety of question types and difficulty levels
Monitor Score Progression - Watch how metric scores change as more examples are added
Test Generalization - Validate optimal examples on other datasets
Balance Example Count - More examples isn’t always better; find the sweet spot

Parameter Optimization

Start Simple - Begin with 2-3 key parameters (temperature, top_p)
Add Gradually - Add more parameters once you understand their impact
Use Domain Knowledge - Set ranges based on your task type
Monitor Convergence - If scores plateau early, you can stop
Check Generalization - Test best parameters on a different dataset
Document Settings - Save optimal parameters in your application config

General Best Practices

Start with One Optimization Type - Master one approach before combining them
Use Consistent Metrics - Use the same metric across multiple optimization runs for comparability
Track Baseline Performance - Always know your starting point to measure improvement
Iterate - Run optimization periodically as your needs change or new data becomes available
Combine Approaches - For maximum improvement, try combining prompt optimization with parameter tuning

Troubleshooting

Optimization Takes Too Long?

Reduce number of trials (start with 15-20)
Use fewer parameters or examples
Reduce dataset/library size
Increase parallel threads (if API limits allow)

Scores Show Little Improvement?

Your baseline may already be well-optimized
Try wider search ranges
Switch to a different optimization type
Review if the task needs a different approach

Large Score Variations?

Some parameters may conflict
Increase number of trials for stability
Review your dataset/library for consistency
Increase samples tested per trial

Optimization Fails?

Check API connectivity and rate limits
Verify parameter ranges are valid for your model
Confirm metric is calculating correctly

​Experimentation Tab

​1. Prompt Optimization

​2. Example Optimization

​3. Parameter Optimization

​Run Experiments Tab

​View All Experiments

​View Trial Details

​Analyze Results

​Best Practices

​Prompt Optimization

​Example Optimization

​Parameter Optimization

​General Best Practices

​Troubleshooting

Experimentation Tab

1. Prompt Optimization

2. Example Optimization

3. Parameter Optimization

Run Experiments Tab

View All Experiments

View Trial Details

Analyze Results

Best Practices

Prompt Optimization

Example Optimization

Parameter Optimization

General Best Practices

Troubleshooting