Skip to main content
The Optimization Playground is a comprehensive tool within Trusys for automatically improving your AI application’s performance. Instead of manually tweaking prompts or parameters, the Optimization Playground systematically tests different configurations and finds the settings that perform best against your chosen evaluation metrics. The Optimization Playground offers three optimization approaches:
  1. Prompt Optimization - Automatically improve your system prompt while keeping the structure fixed
  2. Example Optimization - Find the optimal number and combination of few-shot examples to include in your system prompt
  3. Parameter Optimization - Find the best LLM settings (temperature, top_p, etc.) for your specific task
Each approach systematically tests different variations and measures their performance using your chosen metric.

Experimentation Tab

The Experimentation tab is where you set up and run optimization experiments. Select your optimization type and follow the steps below.

1. Prompt Optimization

Optimize your system prompt to improve performance against your evaluation metric. Setup Steps:
  1. Select Optimization Type - Choose “Prompt Optimization”
  2. Select Application
    • Choose your LLM model (e.g., GPT-4, Claude, Ollama, etc.)
    • This is the model that will be tested with different system prompt variations
  3. Enter System Prompt
    • Provide your current system prompt
    • This serves as the baseline for optimization
    • Trusys will generate variations of this prompt to test
  4. Select Prompt Library
    • Choose the prompt library containing your test cases
    • All prompts in the library will be tested against each system prompt variation
    • Ensure your library includes diverse, representative test cases
  5. Select Metric
    • Choose the evaluation metric to measure performance
    • Options include accuracy, relevance, coherence, and custom metrics
    • The metric score determines which system prompt is best
  6. Configure Optimization Settings
    • Number of Trials: Default 20-30 (higher = more thorough, longer time)
    • Parallel Threads: 4-8 threads for evaluation
    • Local Refinement: Percentage of trials for fine-tuning (default 0.3)
  7. Run Optimization
    • Click Run to start the optimization process
    • Trusys creates multiple trials, each with a slightly different system prompt
    • Progress is shown as trials complete
How It Works:
  • Each trial tests a unique variation of your system prompt
  • All prompts in your prompt library are evaluated with that system prompt variation
  • An average metric score is calculated across all prompt-response pairs
  • Once all trials complete, the system prompt with the highest average metric score is selected as the optimal version
Example Workflow: If you run 10 trials with a prompt library of 50 test cases:
  • Trial 1: System prompt variation A × 50 test cases = average score of 0.82
  • Trial 2: System prompt variation B × 50 test cases = average score of 0.85
  • Trial 3: System prompt variation C × 50 test cases = average score of 0.83
  • … (continues for all 10 trials)
  • Best Result: Trial 2’s system prompt with score 0.85

2. Example Optimization

Find the optimal number and selection of few-shot examples to include in your system prompt. Setup Steps:
  1. Select Optimization Type - Choose “Example Optimization”
  2. Select Application
    • Choose your LLM model
    • This model will evaluate prompts with different few-shot examples
  3. Enter System Prompt
    • Provide your base system prompt (without examples)
    • Examples will be dynamically added to this prompt during optimization
  4. Select Dataset
    • Choose a dataset containing question-answer pairs
    • This dataset provides both the examples to include and the test prompts
    • The dataset is split: some entries become few-shot examples, others become test cases
  5. Select Metric
    • Choose the evaluation metric (accuracy, relevance, etc.)
    • The metric determines which example configuration is best
  6. Configure Optimization Settings
    • Number of Trials: Default 10-15
    • Min Examples: Minimum few-shot examples to include (default: 2)
    • Max Examples: Maximum few-shot examples to include (default: 8)
    • Parallel Threads: Number of parallel evaluation threads
  7. Run Optimization
    • Click Run to start the optimization
    • Trusys creates trials with different numbers and combinations of examples
How It Works:
  • Each trial selects a specific subset of examples from your dataset to use as few-shot examples
  • The remaining dataset items are used as test prompts
  • The system prompt + selected examples are sent with each test prompt
  • The model’s responses are evaluated against the test prompts using your metric
  • An average metric score is calculated for that example configuration
  • Once all trials complete, the system prompt with the optimal examples is selected
Example Workflow: If you run 8 trials with a dataset of 100 items:
  • Trial 1: Add 2 examples → test remaining 98 items → average score 0.79
  • Trial 2: Add 4 examples → test remaining 96 items → average score 0.83
  • Trial 3: Add 6 examples → test remaining 94 items → average score 0.85
  • Trial 4: Add 8 examples → test remaining 92 items → average score 0.82
  • … (more trials testing different combinations)
  • Best Result: Trial 3 with 6 examples and score 0.85

3. Parameter Optimization

Automatically find the best LLM settings (temperature, top_p, etc.) for your specific task. Setup Steps:
  1. Select Optimization Type - Choose “Parameter Optimization”
  2. Select Application
    • Choose your LLM model (OpenAI, Anthropic, Google, Azure, Ollama, etc.)
    • This is the model whose parameters will be optimized
  3. Select Prompt Library or Dataset
    • Prompt Library: Use your predefined test cases with assertions
    • Dataset: Use a dataset with question-answer pairs
    • All items will be tested with each parameter combination
  4. Select Metric
    • Choose how to measure performance (accuracy, relevance, coherence, etc.)
    • The metric score determines which parameter settings are best
  5. Define Parameters to Optimize
    • Select which LLM parameters to tune from available options:
    Temperature (0.0 to 2.0)
    • Controls randomness in responses
    • Low (0.0-0.5): More focused, deterministic (good for factual tasks)
    • High (0.8-2.0): More creative, varied (good for creative tasks)
    Top P (0.0 to 1.0)
    • Controls diversity through probability mass
    • Low (0.1-0.5): More focused, predictable
    • High (0.8-1.0): More diverse, creative
    Frequency Penalty (-2.0 to 2.0)
    • Penalizes repetitive content
    • Negative: Encourages repetition
    • Positive: Reduces repetition
    Presence Penalty (-2.0 to 2.0)
    • Penalizes token reuse
    • Negative: Encourages reusing tokens
    • Positive: Encourages new tokens
    Max Tokens (1 to 4000+)
    • Controls response length
    • Lower: Shorter, more concise
    • Higher: Longer, more detailed
  6. Set Parameter Ranges
    • For each parameter, define the search range
    • Example: Temperature 0.0 to 1.5, Top P 0.5 to 1.0
    • Use quantization (step sizes) to reduce search space
    Tips:
    • Start conservative with narrower ranges
    • Center ranges around your current parameter values
    • Use step increments (e.g., temperature in 0.1 increments)
  7. Configure Optimization Settings
    • Number of Trials: Default 20-30 for 2-3 parameters
      • 2 parameters: 20-30 trials
      • 3-4 parameters: 30-50 trials
      • 5+ parameters: 50-100 trials
    • Parallel Threads: 4-8 (balance with API rate limits)
    • Local Refinement Ratio: 0.3 (30% of trials)
      • Higher (0.4-0.5): More refinement near best region
      • Lower (0.1-0.2): More broad exploration
  8. Run Optimization
    • Click Run to start
    • Baseline trial runs with current settings
    • Systematic trials test different parameter combinations
    • Best parameter combination is highlighted when complete
How It Works:
  1. A baseline trial establishes your starting performance
  2. The optimizer systematically explores different parameter combinations
  3. Each combination is tested against all prompts using your metric
  4. An average metric score is calculated for each parameter set
  5. The optimizer balances exploration (trying diverse settings) with exploitation (refining promising configurations)
  6. Parameter importance is calculated to show which parameters had the most impact
  7. The best parameter configuration is returned with detailed analysis
Common Parameter Combinations: For Factual/Accurate Responses:
Temperature: 0.3-0.5
Top P: 0.8-0.9
Frequency Penalty: 0.3-0.5
Max Tokens: 1000-2000
For Creative/Varied Responses:
Temperature: 1.0-1.5
Top P: 0.9-1.0
Frequency Penalty: 0.8-1.2
Max Tokens: 2000-3000
For Concise Responses:
Temperature: 0.5-0.8
Top P: 0.8-0.95
Frequency Penalty: 0.3-0.5
Max Tokens: 500-1000

Run Experiments Tab

The Run Experiments tab displays all your completed optimization experiments with detailed results.

View All Experiments

This section shows a list of all optimization runs. For each experiment, you can see:
  • Experiment Name - Name of the optimization run
  • Type - Prompt, Example, or Parameter Optimization
  • Application - Which LLM model was used
  • Status - Completed, Running, or Failed
  • Date Created - When the experiment was run
  • Number of Trials - How many variations were tested
  • Best Trail - The trail with highest metric score

View Trial Details

Click on any experiment to see detailed information about each trial: Trial Information:
  • Trial Number - Sequential trial identifier
  • Configuration - System prompt, examples, or parameter values used
  • Metric Score - The average score for that trial’s configuration
  • Status - Pass/Fail status
Response Details:
  • User Prompt - The test prompt sent to the model
  • Model Response - The output generated by the model
  • Metric Score - The evaluation score for that specific response
  • Reasoning - Explanation of the score

Analyze Results

Compare Configurations:
  • View side-by-side comparison of different trials
  • See how metric scores change across variations
  • Identify which changes had the most impact
Identify Patterns:
  • Look at which configurations performed best
  • Understand what characteristics correlate with higher scores
  • For parameter optimization, see parameter importance rankings
Export and Share:
  • Download detailed results for reporting
  • Share optimization findings with your team
  • Use results to inform future engineering

Best Practices

Prompt Optimization

  1. Start with a Strong Baseline - Your initial system prompt should be reasonably good
  2. Use Representative Test Cases - Your prompt library should reflect real-world usage
  3. Choose the Right Metric - Select a metric that directly measures success for your task
  4. Review Top Variations - Look at top 3-5 system prompts, not just the #1
  5. Validate on New Data - Test the best system prompt on a fresh dataset

Example Optimization

  1. Use Quality Examples - Your dataset examples should be high-quality and representative
  2. Use Diverse Examples - Include a variety of question types and difficulty levels
  3. Monitor Score Progression - Watch how metric scores change as more examples are added
  4. Test Generalization - Validate optimal examples on other datasets
  5. Balance Example Count - More examples isn’t always better; find the sweet spot

Parameter Optimization

  1. Start Simple - Begin with 2-3 key parameters (temperature, top_p)
  2. Add Gradually - Add more parameters once you understand their impact
  3. Use Domain Knowledge - Set ranges based on your task type
  4. Monitor Convergence - If scores plateau early, you can stop
  5. Check Generalization - Test best parameters on a different dataset
  6. Document Settings - Save optimal parameters in your application config

General Best Practices

  1. Start with One Optimization Type - Master one approach before combining them
  2. Use Consistent Metrics - Use the same metric across multiple optimization runs for comparability
  3. Track Baseline Performance - Always know your starting point to measure improvement
  4. Iterate - Run optimization periodically as your needs change or new data becomes available
  5. Combine Approaches - For maximum improvement, try combining prompt optimization with parameter tuning

Troubleshooting

Optimization Takes Too Long?
  • Reduce number of trials (start with 15-20)
  • Use fewer parameters or examples
  • Reduce dataset/library size
  • Increase parallel threads (if API limits allow)
Scores Show Little Improvement?
  • Your baseline may already be well-optimized
  • Try wider search ranges
  • Switch to a different optimization type
  • Review if the task needs a different approach
Large Score Variations?
  • Some parameters may conflict
  • Increase number of trials for stability
  • Review your dataset/library for consistency
  • Increase samples tested per trial
Optimization Fails?
  • Check API connectivity and rate limits
  • Verify parameter ranges are valid for your model
  • Confirm metric is calculating correctly