- Prompt Optimization - Automatically improve your system prompt while keeping the structure fixed
- Example Optimization - Find the optimal number and combination of few-shot examples to include in your system prompt
- Parameter Optimization - Find the best LLM settings (temperature, top_p, etc.) for your specific task
Experimentation Tab
The Experimentation tab is where you set up and run optimization experiments. Select your optimization type and follow the steps below.1. Prompt Optimization
Optimize your system prompt to improve performance against your evaluation metric. Setup Steps:- Select Optimization Type - Choose “Prompt Optimization”
-
Select Application
- Choose your LLM model (e.g., GPT-4, Claude, Ollama, etc.)
- This is the model that will be tested with different system prompt variations
-
Enter System Prompt
- Provide your current system prompt
- This serves as the baseline for optimization
- Trusys will generate variations of this prompt to test
-
Select Prompt Library
- Choose the prompt library containing your test cases
- All prompts in the library will be tested against each system prompt variation
- Ensure your library includes diverse, representative test cases
-
Select Metric
- Choose the evaluation metric to measure performance
- Options include accuracy, relevance, coherence, and custom metrics
- The metric score determines which system prompt is best
-
Configure Optimization Settings
- Number of Trials: Default 20-30 (higher = more thorough, longer time)
- Parallel Threads: 4-8 threads for evaluation
- Local Refinement: Percentage of trials for fine-tuning (default 0.3)
-
Run Optimization
- Click Run to start the optimization process
- Trusys creates multiple trials, each with a slightly different system prompt
- Progress is shown as trials complete
- Each trial tests a unique variation of your system prompt
- All prompts in your prompt library are evaluated with that system prompt variation
- An average metric score is calculated across all prompt-response pairs
- Once all trials complete, the system prompt with the highest average metric score is selected as the optimal version
- Trial 1: System prompt variation A × 50 test cases = average score of 0.82
- Trial 2: System prompt variation B × 50 test cases = average score of 0.85
- Trial 3: System prompt variation C × 50 test cases = average score of 0.83
- … (continues for all 10 trials)
- Best Result: Trial 2’s system prompt with score 0.85
2. Example Optimization
Find the optimal number and selection of few-shot examples to include in your system prompt. Setup Steps:- Select Optimization Type - Choose “Example Optimization”
-
Select Application
- Choose your LLM model
- This model will evaluate prompts with different few-shot examples
-
Enter System Prompt
- Provide your base system prompt (without examples)
- Examples will be dynamically added to this prompt during optimization
-
Select Dataset
- Choose a dataset containing question-answer pairs
- This dataset provides both the examples to include and the test prompts
- The dataset is split: some entries become few-shot examples, others become test cases
-
Select Metric
- Choose the evaluation metric (accuracy, relevance, etc.)
- The metric determines which example configuration is best
-
Configure Optimization Settings
- Number of Trials: Default 10-15
- Min Examples: Minimum few-shot examples to include (default: 2)
- Max Examples: Maximum few-shot examples to include (default: 8)
- Parallel Threads: Number of parallel evaluation threads
-
Run Optimization
- Click Run to start the optimization
- Trusys creates trials with different numbers and combinations of examples
- Each trial selects a specific subset of examples from your dataset to use as few-shot examples
- The remaining dataset items are used as test prompts
- The system prompt + selected examples are sent with each test prompt
- The model’s responses are evaluated against the test prompts using your metric
- An average metric score is calculated for that example configuration
- Once all trials complete, the system prompt with the optimal examples is selected
- Trial 1: Add 2 examples → test remaining 98 items → average score 0.79
- Trial 2: Add 4 examples → test remaining 96 items → average score 0.83
- Trial 3: Add 6 examples → test remaining 94 items → average score 0.85
- Trial 4: Add 8 examples → test remaining 92 items → average score 0.82
- … (more trials testing different combinations)
- Best Result: Trial 3 with 6 examples and score 0.85
3. Parameter Optimization
Automatically find the best LLM settings (temperature, top_p, etc.) for your specific task. Setup Steps:- Select Optimization Type - Choose “Parameter Optimization”
-
Select Application
- Choose your LLM model (OpenAI, Anthropic, Google, Azure, Ollama, etc.)
- This is the model whose parameters will be optimized
-
Select Prompt Library or Dataset
- Prompt Library: Use your predefined test cases with assertions
- Dataset: Use a dataset with question-answer pairs
- All items will be tested with each parameter combination
-
Select Metric
- Choose how to measure performance (accuracy, relevance, coherence, etc.)
- The metric score determines which parameter settings are best
-
Define Parameters to Optimize
- Select which LLM parameters to tune from available options:
- Controls randomness in responses
- Low (0.0-0.5): More focused, deterministic (good for factual tasks)
- High (0.8-2.0): More creative, varied (good for creative tasks)
- Controls diversity through probability mass
- Low (0.1-0.5): More focused, predictable
- High (0.8-1.0): More diverse, creative
- Penalizes repetitive content
- Negative: Encourages repetition
- Positive: Reduces repetition
- Penalizes token reuse
- Negative: Encourages reusing tokens
- Positive: Encourages new tokens
- Controls response length
- Lower: Shorter, more concise
- Higher: Longer, more detailed
-
Set Parameter Ranges
- For each parameter, define the search range
- Example: Temperature 0.0 to 1.5, Top P 0.5 to 1.0
- Use quantization (step sizes) to reduce search space
- Start conservative with narrower ranges
- Center ranges around your current parameter values
- Use step increments (e.g., temperature in 0.1 increments)
-
Configure Optimization Settings
- Number of Trials: Default 20-30 for 2-3 parameters
- 2 parameters: 20-30 trials
- 3-4 parameters: 30-50 trials
- 5+ parameters: 50-100 trials
- Parallel Threads: 4-8 (balance with API rate limits)
- Local Refinement Ratio: 0.3 (30% of trials)
- Higher (0.4-0.5): More refinement near best region
- Lower (0.1-0.2): More broad exploration
- Number of Trials: Default 20-30 for 2-3 parameters
-
Run Optimization
- Click Run to start
- Baseline trial runs with current settings
- Systematic trials test different parameter combinations
- Best parameter combination is highlighted when complete
- A baseline trial establishes your starting performance
- The optimizer systematically explores different parameter combinations
- Each combination is tested against all prompts using your metric
- An average metric score is calculated for each parameter set
- The optimizer balances exploration (trying diverse settings) with exploitation (refining promising configurations)
- Parameter importance is calculated to show which parameters had the most impact
- The best parameter configuration is returned with detailed analysis
Run Experiments Tab
The Run Experiments tab displays all your completed optimization experiments with detailed results.View All Experiments
This section shows a list of all optimization runs. For each experiment, you can see:- Experiment Name - Name of the optimization run
- Type - Prompt, Example, or Parameter Optimization
- Application - Which LLM model was used
- Status - Completed, Running, or Failed
- Date Created - When the experiment was run
- Number of Trials - How many variations were tested
- Best Trail - The trail with highest metric score
View Trial Details
Click on any experiment to see detailed information about each trial: Trial Information:- Trial Number - Sequential trial identifier
- Configuration - System prompt, examples, or parameter values used
- Metric Score - The average score for that trial’s configuration
- Status - Pass/Fail status
- User Prompt - The test prompt sent to the model
- Model Response - The output generated by the model
- Metric Score - The evaluation score for that specific response
- Reasoning - Explanation of the score
Analyze Results
Compare Configurations:- View side-by-side comparison of different trials
- See how metric scores change across variations
- Identify which changes had the most impact
- Look at which configurations performed best
- Understand what characteristics correlate with higher scores
- For parameter optimization, see parameter importance rankings
- Download detailed results for reporting
- Share optimization findings with your team
- Use results to inform future engineering
Best Practices
Prompt Optimization
- Start with a Strong Baseline - Your initial system prompt should be reasonably good
- Use Representative Test Cases - Your prompt library should reflect real-world usage
- Choose the Right Metric - Select a metric that directly measures success for your task
- Review Top Variations - Look at top 3-5 system prompts, not just the #1
- Validate on New Data - Test the best system prompt on a fresh dataset
Example Optimization
- Use Quality Examples - Your dataset examples should be high-quality and representative
- Use Diverse Examples - Include a variety of question types and difficulty levels
- Monitor Score Progression - Watch how metric scores change as more examples are added
- Test Generalization - Validate optimal examples on other datasets
- Balance Example Count - More examples isn’t always better; find the sweet spot
Parameter Optimization
- Start Simple - Begin with 2-3 key parameters (temperature, top_p)
- Add Gradually - Add more parameters once you understand their impact
- Use Domain Knowledge - Set ranges based on your task type
- Monitor Convergence - If scores plateau early, you can stop
- Check Generalization - Test best parameters on a different dataset
- Document Settings - Save optimal parameters in your application config
General Best Practices
- Start with One Optimization Type - Master one approach before combining them
- Use Consistent Metrics - Use the same metric across multiple optimization runs for comparability
- Track Baseline Performance - Always know your starting point to measure improvement
- Iterate - Run optimization periodically as your needs change or new data becomes available
- Combine Approaches - For maximum improvement, try combining prompt optimization with parameter tuning
Troubleshooting
Optimization Takes Too Long?- Reduce number of trials (start with 15-20)
- Use fewer parameters or examples
- Reduce dataset/library size
- Increase parallel threads (if API limits allow)
- Your baseline may already be well-optimized
- Try wider search ranges
- Switch to a different optimization type
- Review if the task needs a different approach
- Some parameters may conflict
- Increase number of trials for stability
- Review your dataset/library for consistency
- Increase samples tested per trial
- Check API connectivity and rate limits
- Verify parameter ranges are valid for your model
- Confirm metric is calculating correctly