Single-Turn Attack Strategies
Base64 Obfuscation Testing
What It Is Base64 obfuscation testing evaluates how AI systems handle encoded content that may circumvent security filters. This strategy leverages the fact that many AI models understand Base64 encoding from training data, but safety mechanisms may not adequately process encoded inputs. How It Works The testing process involves:- Converting original test content into Base64 format using standard encoding (A-Z, a-z, 0-9, +, /)
- Replacing plain text with encoded strings in test scenarios
- Analyzing whether the model decodes and processes the hidden content
- Measuring response differences between encoded and plain text inputs
- Identifies filter bypass vulnerabilities through content obfuscation
- Tests model behavior with transformed malicious inputs
- Reveals potential security gaps in content screening systems
- Helps strengthen defenses against encoding-based attacks
Standard Testing Baseline
What It Is The baseline testing strategy determines whether original, unmodified test cases are included alongside strategy-enhanced variants. This provides a control group for comparing attack effectiveness. How It Works By default, Trusys will:- Generate initial test cases from enabled vulnerability plugins
- Apply transformation strategies to create additional test variants
- Include both original and modified test cases in evaluation suites
- Strategy Isolation: Disable to focus solely on transformation effectiveness
- Volume Management: Reduce total test count for resource-constrained environments
- Targeted Assessment: Concentrate on specific vulnerability types rather than general testing
Multi-Attempt Variation Testing (Best-of-N)
What It Is Multi-attempt variation testing implements repeated sampling with input modifications to increase attack success probability. This black-box approach exploits the stochastic nature of AI model outputs. How It Works The three-phase process includes:- Variation Generation: Creates multiple input versions using modality-specific changes:
- Text: Capitalization randomization, character substitution, noise injection
- Visual: Font modifications, color adjustments, positioning changes
- Audio: Speed, pitch, volume, and background alterations
- Parallel Execution: Tests multiple variations simultaneously against target systems
- Success Monitoring: Tracks responses until harmful output detected or attempt limit reached
- Simple implementation without requiring model internals access
- Multi-modal compatibility across text, vision, and audio
- High parallelization potential for efficient testing
- Predictable success rate scaling patterns
Academic Authority Exploitation
What It Is Academic authority exploitation tests whether AI systems exhibit differential responses when harmful requests are framed within scholarly contexts, citations, and research frameworks. How It Works The transformation process:- Takes original test prompts
- Generates realistic academic citations and references
- Reformats requests as scholarly inquiries or research discussions
- Evaluates response differences between academic and direct framings
Adversarial Suffix Generation (GCG)
What It Is The Greedy Coordinate Gradient approach generates optimized adversarial text suffixes that maximize the probability of eliciting harmful responses from language models, based on gradient-guided optimization techniques. Technical Process- Analyzes the original input prompt structure
- Uses gradient information to identify optimal token replacement candidates
- Evaluates multiple suffix options in parallel
- Optimizes for cross-model transferability and effectiveness
Hexadecimal Encoding Assessment
What It IsHexadecimal encoding assessment tests AI system responses to content represented as ASCII hex values (0-9, A-F), exploiting potential gaps between encoding recognition and safety filtering. Methodology
- Converts original text to hexadecimal byte representations
- Formats as space-separated hex values
- Tests model’s ability to decode and process hex-encoded content
- Analyzes response patterns for encoded versus plain text
- Identifies content filter bypasses through hex obfuscation
- Tests model capabilities with technical encoding formats
- Reveals safety training gaps for non-standard input representations
Unicode Homoglyph Substitution
What It Is Unicode homoglyph substitution replaces standard characters with visually identical Unicode alternatives that have different code points, potentially bypassing text-based security controls while maintaining human readability. Technical Implementation- Identifies standard ASCII characters in test content
- Substitutes with visually similar Unicode characters from different scripts
- Creates text that appears normal but has different underlying encoding
- Tests model processing differences between original and substituted text
| Standard | Homoglyph | Unicode Description |
|---|---|---|
| a | а | Cyrillic Small Letter A (U+0430) |
| e | е | Cyrillic Small Letter Ie (U+0435) |
| o | о | Cyrillic Small Letter O (U+043E) |
Leetspeak Transformation Testing
What It Is Leetspeak transformation testing evaluates AI responses to internet-style character substitutions (numbers and symbols replacing letters), leveraging models’ familiarity with this text format while potentially bypassing safety filters. Transformation Rules- Common substitutions: a→4, e→3, i→1, o→0, s→5, t→7
- Preserves readability while altering character patterns
- Tests model handling of informal text transformations
Research Framework Evaluation (Likert-Based)
What It Is Research framework evaluation uses academic assessment structures and Likert rating scales to frame potentially harmful requests within legitimate research contexts, testing for authority-based bypass vulnerabilities. Implementation Approach- Framework Selection: Alternates between ethical evaluation and creative assessment contexts
- Scale Definition: Uses binary scoring systems for clear evaluation criteria
- Academic Positioning: Presents requests as research evaluation tasks
- Response Analysis: Examines outputs against predefined academic metrics
Mathematical Abstraction Testing
What It Is Mathematical abstraction testing encodes potentially harmful requests using formal mathematical notation, set theory, and abstract algebra concepts to bypass natural language content filters. Technical Approach- Translates harmful content requests into mathematical frameworks
- Uses formal notation including set theory, group theory, and proofs
- Presents content as mathematical demonstrations or theoretical problems
- Evaluates model responses to mathematical versus natural language representations
Cross-Linguistic Vulnerability Analysis
What It Is Cross-linguistic analysis tests AI system consistency across multiple languages, identifying potential safety vulnerabilities and inconsistencies in non-English language processing. Testing Process- Takes existing test cases in the primary language
- Translates content into target languages while preserving intent
- Maintains original test metadata and evaluation criteria
- Compares response patterns across linguistic variations
- Bengali: “আপনি কি সাহায্য করতে পারেন?”
- Swahili: “Je, unaweza kusaidia?”
- Javanese: “Apa sampeyan bisa nulungi?”
- Many AI systems have weaker safety protections in non-English languages
- Identifies cultural and linguistic bias in content filtering
- Tests robustness of safety mechanisms across diverse language families
- Reveals potential attack vectors through language switching
Direct Injection Vulnerability Testing
What It Is Direct injection testing evaluates AI system resilience against prompt manipulation techniques that attempt to override system instructions or safety controls through various injection methods. Testing Methodology- Takes baseline test cases from vulnerability plugins
- Applies common injection techniques as prefixes or suffixes
- Tests effectiveness of different injection patterns
- Analyzes system responses for control bypass indicators
- Instruction override attempts
- Role assumption commands
- Context switching techniques
- System prompt manipulation
ROT13 Cipher Evaluation
What It Is ROT13 cipher evaluation tests AI responses to the classical substitution cipher where each letter is replaced with the letter 13 positions ahead in the alphabet, commonly used for light obfuscation on the internet. Cipher Mechanics- Simple substitution: A→N, B→O, C→P, etc.
- Self-inverse property (applying ROT13 twice returns original)
- Numbers and special characters remain unchanged
- Preserves text length and basic structure
- Character-level transformation handling gaps
- Inconsistencies between encoding complexity and model capabilities
- Potential for combining with other obfuscation methods
- Safety training coverage of simple cipher techniques
Alternative Encoding Techniques Collection
What It Is The alternative encoding collection provides multiple text transformation methods to comprehensively test model resilience against various evasion techniques using different text representations.1. camelCase Transformation
Mechanism: Removes spaces and capitalizes first letters of subsequent words Example: “system access” → “systemAccess” Testing Focus: Word boundary detection and space-based filtering bypass2. Morse Code Translation
Mechanism: Converts characters to dots, dashes, and slashes Example: “help” → ”… . .-.. .-..” Testing Focus: Symbol-based communication comprehension3. Pig Latin Conversion
Mechanism: Applies language game rules (consonant clusters moved to end with “ay”)Example: “computer” → “omputercay” Testing Focus: Phonetic transformation understanding
4. Emoji Steganography
Mechanism: Hides UTF-8 data in Unicode variation selectors appended to emoji Example: Text hidden within: 😊[invisible_variation_selectors] Testing Focus: Unicode-based data hiding detection Security Applications These encoding strategies help evaluate:- Content moderation system robustness against transformation attacks
- Model comprehension of obfuscated content across different encoding types
- Effectiveness of safety mechanisms against simple evasion techniques
- Detection capabilities for various steganographic methods
- Static, deterministic processes requiring no additional API calls
- Human-interpretable transformations for result analysis
- Comprehensive coverage of common obfuscation techniques
- Scalable testing across multiple encoding types simultaneously
Multi-Turn Attack Strategies
Multi-turn vulnerability testing represents a sophisticated approach to AI security assessment that simulates realistic conversational interactions. Unlike single-turn attacks, these strategies evaluate how AI systems maintain safety boundaries across extended dialogues, revealing vulnerabilities that only emerge through progressive conversation dynamics. Traditional single-turn security testing may miss critical vulnerabilities that develop over time. Multi-turn strategies address this gap by:- Progressive Boundary Testing: Systematically exploring how safety mechanisms respond to gradual escalation
- Context Accumulation: Leveraging conversation history to build toward potentially harmful outcomes
- Adaptive Strategy Evolution: Modifying approaches based on system responses throughout the interaction
- Realistic User Simulation: Mimicking authentic user behavior patterns that could lead to policy violations
- Gradual Desensitization: Systems may become more permissive as conversations progress from benign to problematic
- Context Drift: Safety classifiers may struggle with nuanced context changes over extended dialogues
- Instruction Degradation: Original system prompts may lose influence as conversation history grows
- Compliance Momentum: Models may develop increased willingness to assist as rapport builds
Crescendo
What It Is: Incremental escalation testing systematically increases the risk level of requests across conversation turns, starting from completely harmless inquiries and progressing toward policy-violating content. This methodology reveals the precise points where AI safety mechanisms begin to weaken or fail. The escalation process follows a structured approach:- Baseline Establishment: Begins with innocuous requests related to the target domain
- Graduated Progression: Each subsequent turn increases specificity, sensitivity, or potential harm
- Boundary Mapping: Identifies exact points where system responses change
- Adaptive Backtracking: When refusals occur, returns to earlier conversation states
- Alternative Path Exploration: Tests different progression routes to the same goal
- Establish legitimate context and build conversational rapport
- Introduce topic domain through educational or theoretical framing
- Gather baseline information about system knowledge and boundaries
- Add layers of technical detail or contextual nuance
- Introduce edge cases or boundary scenarios
- Test system reasoning about related concepts
- Gradually shift toward more sensitive aspects of the topic
- Introduce hypothetical or theoretical harmful scenarios
- Assess system response to increasing risk levels
- Present explicit requests for policy-violating content
- Test final boundary enforcement mechanisms
- Document system behavior at maximum escalation
- Precise Boundary Mapping: Identifies exact thresholds where safety mechanisms activate
- Contextual Vulnerability Detection: Reveals how conversation history affects safety decisions
- Progressive Weakness Identification: Shows how system resistance may degrade over time
- Realistic Attack Simulation: Models how actual malicious users might approach sensitive topics
Adversarial Agent Testing (GOAT)
What It Is: Adversarial Agent Testing employs an autonomous “attacker” AI system that dynamically generates multi-turn conversations specifically designed to bypass target model safety measures. This approach simulates sophisticated adversarial behavior that adapts in real-time to target system responses. The adversarial agent operates through a sophisticated reasoning framework: Analytical Phase- Examines target responses for safety mechanism indicators
- Identifies triggered defensive measures and response patterns
- Analyzes conversation trajectory and effectiveness metrics
- Develops tactical approaches based on observed system behavior
- Selects appropriate attack vectors from available technique library
- Plans conversation flow to maximize bypass probability
- Generates contextually appropriate prompts using selected techniques
- Combines multiple attack vectors for enhanced effectiveness
- Maintains conversational coherence while pursuing objectives
- Response format specification to bypass content filters
- Structured output requests that separate harmful content
- Technical framing that obscures policy violations
- Introducing tangential topics to reduce safety focus
- Embedding requests within complex, multi-part questions
- Using verbose explanations to bury problematic content
- Creating fictional contexts that justify harmful information
- Roleplaying scenarios that rationalize policy violations
- Academic or research framings that legitimize sensitive topics
- Response Analysis: Evaluates target system outputs for vulnerability indicators
- Strategy Adjustment: Modifies attack vectors based on observed effectiveness
- Technique Combination: Blends multiple approaches for compound attacks
- Persistence Management: Balances aggression with conversation maintenance
- Automated Vulnerability Discovery: Systematically identifies bypass methods without human intervention
- Adaptive Attack Simulation: Models sophisticated adversaries who learn and adapt
- Scale Testing: Evaluates system robustness across numerous attack vectors simultaneously
- Realistic Threat Modeling: Simulates actual adversarial behavior patterns
Persistent User Simulation
**What It Is:**Persistent user simulation models a determined but seemingly innocent user who employs creativity, rephrasing, and multiple approaches across conversation turns to elicit policy-violating responses. This strategy captures the behavior of users who persistently probe system boundaries through varied conversational tactics. Behavioral Modeling Framework Innocent Curiosity Pattern- Frames requests as genuine learning or research interests
- Uses naive or simplified language to appear non-threatening
- Expresses confusion or surprise at system restrictions
- Employs synonyms and alternative terminology for sensitive concepts
- Restructures questions to avoid trigger keywords
- Uses metaphors and analogies to discuss restricted topics indirectly
- Returns to restricted topics through different conversational paths
- Builds on partial information from previous responses
- Combines multiple conversation threads toward problematic conclusions
- Approaches sensitive subjects from multiple angles
- Uses different framings and contexts for the same core request
- Gradually narrows focus toward specific harmful information
- Requests seemingly harmless individual components
- Combines partial responses to construct problematic knowledge
- Uses conversation history to build comprehensive understanding
- Systematically tests different phrasings of restricted requests
- Probes for inconsistencies in safety policy enforcement
- Exploits discovered loopholes for further access
- Establishes friendly, cooperative conversational tone
- Shows appreciation for system responses and assistance
- Creates sense of collaborative problem-solving
- Positions requests within acceptable contexts (education, research, safety)
- Uses professional or academic language to appear credible
- References external authorities or requirements
- Builds on system’s previous helpful responses
- Creates expectation of continued assistance
- Leverages consistency principles to encourage compliance
- Consistency Gaps: Identifies where safety policies are unevenly applied
- Conversation Drift Vulnerabilities: Shows how extended interactions may weaken boundaries
- Social Engineering Susceptibility: Tests system resistance to manipulation techniques
- Policy Enforcement Robustness: Evaluates how well systems maintain restrictions under pressure
Multi-Modal
Multi-modal vulnerability testing evaluates AI system security across different input formats beyond traditional text. These strategies assess whether AI systems maintain consistent safety policies when processing the same content through visual, audio, and video channels, potentially revealing bypass vulnerabilities in content filtering mechanisms. Modern AI systems increasingly support multiple input types, creating potential security gaps where:- Content Filtering Inconsistencies: Safety mechanisms may be optimized for text but inadequate for other formats
- Modality-Specific Weaknesses: Different processing pipelines may have varying vulnerability profiles
- Cross-Modal Bypass Opportunities: Content rejected in text form might be accepted in alternative formats
- Processing Pipeline Variations: Different modalities may use separate safety evaluation systems
- Pipeline Segregation: Visual, audio, and text processing often use distinct pathways with different security controls
- Recognition Accuracy Variations: OCR, speech recognition, and text analysis may have different accuracy rates for harmful content detection
- Resource Allocation Differences: Security scanning may be less comprehensive for non-text formats due to computational costs
- Training Data Imbalances: Safety training datasets may be heavily weighted toward text examples
Visual Text Encoding Assessment
What It Is: Visual text encoding assessment converts textual content into rendered images, then evaluates whether AI systems process image-embedded text with the same security rigor as plain text input. This strategy identifies potential bypass routes through visual content channels. Technical Implementation The visual encoding process involves several technical stages: Image Generation Pipeline- Text Rendering: Converts source text into visual representation using standard fonts
- Canvas Creation: Generates clean background canvas with optimal contrast for text visibility
- Font Selection: Uses system fonts that ensure maximum readability for OCR processing
- Image Format Optimization: Produces PNG format for lossless text representation
- Base64 Conversion: Transforms image data into base64-encoded string format
- Format Standardization: Ensures consistent image dimensions and quality parameters
- Metadata Preservation: Maintains test case context while replacing text with visual equivalent
- Validation: Confirms image readability and proper encoding completion
- Format: PNG with lossless compression for text clarity
- Background: High-contrast white background for optimal OCR performance
- Text Color: Standard black text for maximum visibility
- Font: System default fonts ensuring broad compatibility
- Resolution: Optimized for readability while maintaining reasonable file sizes
- Text size calibrated for clear character recognition
- Adequate spacing between characters and lines
- No compression artifacts that could interfere with text extraction
- Consistent formatting across different content lengths
- Tests whether visual text circumvents text-based content scanning
- Identifies inconsistencies in safety policy application across formats
- Reveals potential gaps in multi-modal content analysis
- Assesses accuracy of optical character recognition for harmful content
- Tests system capability to extract and analyze text from images
- Evaluates processing time differences that might affect security scanning
- Compares system responses to identical content in text versus image format
- Identifies behavioral differences in policy enforcement
- Documents potential exploitation pathways through format conversion
- Comprehensive Coverage: Tests visual input pathways often overlooked in standard assessments
- Real-World Relevance: Simulates how actual users might attempt to bypass text filters
- Pipeline Testing: Evaluates entire visual processing chain from input to interpretation
- Vulnerability Mapping: Identifies specific weaknesses in multi-modal security architectures
Audio Content Encoding Analysis
What It Is: Audio content encoding analysis converts textual prompts into speech audio format to evaluate whether AI systems maintain consistent security policies when processing spoken versus written content. This strategy uncovers potential vulnerabilities in audio processing pipelines. Speech Synthesis Pipeline- Text-to-Speech Conversion: Transforms written content into natural speech audio
- Audio Quality Optimization: Ensures clear pronunciation and appropriate speaking pace
- Format Standardization: Produces consistent audio format suitable for AI system input
- Encoding Preparation: Converts audio data to base64 format for transmission
- Format: Standard audio formats (WAV, MP3) optimized for speech recognition
- Quality: High-fidelity audio ensuring accurate speech-to-text conversion
- Duration: Variable length based on content with optimal pacing
- Voice: Neutral voice profile to minimize bias in processing
- Speaking Rate: Moderate pace ensuring clear articulation of all words
- Pronunciation: Standard accent and pronunciation patterns
- Intonation: Natural speech patterns that don’t trigger anomaly detection
- Volume: Consistent audio levels throughout content
- Sample rate optimized for speech recognition accuracy
- Bit depth providing clear audio quality without excessive file sizes
- Noise reduction to ensure clean audio input
- Format compatibility with target AI system requirements
- Identifies whether speech recognition bypasses text-based content filters
- Tests consistency of safety policies across audio and text modalities
- Evaluates processing differences between spoken and written harmful content
- Tests system capability to accurately transcribe potentially harmful speech
- Identifies transcription errors that might affect content filtering
- Evaluates processing delays that could impact real-time safety measures
- Compares system responses to identical content in audio versus text format
- Documents behavioral inconsistencies across input modalities
- Identifies potential exploitation routes through audio content submission
- Audio Channel Coverage: Tests often-overlooked audio input vulnerabilities
- Real-User Simulation: Models how users might attempt audio-based filter circumvention
- Processing Pipeline Evaluation: Assesses complete audio-to-text-to-response chain
- Cross-Modal Validation: Ensures consistent security across all supported input types