Evaluation: Selector

The Selector evaluation is a specialized tool for forced-choice testing. It presents the AI with a list of options and requires it to choose the one that best satisfies the prompt's criteria.

How it Works

  1. Source Binding: You select a Dataset or Folder (Persona/Business) as the source of options.

  2. Permutation: The system takes N items from that source (e.g., 3 options) and injects them into the User Prompt.

  3. Choice Extraction: The AI is instructed (via an automatic system suffix) to return its choice in a specific format (JSON or a clear label like "Option A").

  4. Leaderboard Tracking: The system tracks which specific items from your source are selected most often across all scenarios and models.

The Permutation Algorithm: The Selector experiment doesn't just shuffle items; it uses a systematic multi-pass permutation algorithm to ensure rigorous testing.
  • Selection (N P K): The system takes your source items n) and generates every possible subset of size k (the "Options per Run").

  • Order Sensitivity: Unlike a "Combination," the Selector uses Ordered Permutations. This means the set [Option A, Option B] and [Option B, Option A] are treated as two distinct scenarios.

  • Position Bias Detection: By presenting the same options in different positions across multiple runs, the system can detect if a model has a "Position Bias" (e.g., a tendency to always pick the first option regardless of content).

  • Combinatorial Growth: Note that as n and k increase, the number of runs grows rapidly following the formula: P(n, k) = n! / (n - k)!

    • Example: 5 items with 3 options per run results in 60 unique scenarios per model.


Configuration

Options per Run

  • Range: 2 to 10.

  • Function: Controls how many items from your source are shown to the AI in a single request.

  • Tip: High counts (e.g., 10) test the AI's ability to handle long context, while low counts (e.g., 2) are better for simple A/B testing.

Listing Style

Letters (A, B, C): Conventional choice labels.

Best For: Numeric data, currency, or dates. Using "Option 1: 500" can confuse the AI's internal tokenization. "Option A: 500" provides a clear semantic boundary between the Label and the Value.

Numbers (1, 2, 3): Useful for very long lists.

Best For: Long text descriptions, names, or abstract concepts. Numbers provide a familiar hierarchical structure for categorical data that doesn't contain digits.

Auto-Formatting: The system automatically generates the markdown list for you, so you don't need to manually map variables like {{option_a}}.

Data Configuration (Folders)

If your source is a Folder (e.g., a Persona group):

  • Full Profile: Injects the entire record for each option.

  • Section Selection: Injects only specific fields (e.g., just the "Bio" and "Pain Points"), keeping the prompt focused and saving tokens.


Extraction Logic

The system uses a multi-stage approach to find the AI's choice:

  1. JSON Search: Looks for a JSON object with an option or choice key.

  2. Label Match: Scans for patterns like "Option A", "Choice: 1", or simply "A".

  3. Direct Match: If the AI output is just a single character/digit, it is matched directly to the option index.


Use Cases

📈 Marketing & Business

  1. Headline A/B Testing: Present 5 variations of a landing page headline to see which one a "Skeptic" persona finds most trustworthy.

  2. Product Recommendation: Give the AI a customer's purchase history and ask it to pick the best next product from a catalog.

  3. Brand Voice Consistency: Show 4 variations of a tweet and ask which one best adheres to the specific brand voice defined in a Business profile.

  4. Pricing Strategy: Present various price points and ask a "Frugal Persona" which one feels most like a "steal" vs "fair value."

  5. Target Audience Segmenting: Provide 3 audience profiles and ask which one would most likely respond to a specific discount offer.


🛡️ Content & Quality

  1. Content Moderation: Present 3 user comments and ask the AI to select the one that violates a specific safety policy.

  2. Sentiment Labeling: Give 5 reviews and ask the AI to pick the most "Positive" or "Urgent" one.

  3. Summary Quality: Paste a long article and provide 3 AI-generated summaries; ask a "Professional Editor" persona to pick the most accurate one.

  4. Legal Document Classification: Present 4 contract clauses and ask the AI to categorize them into "High Risk" vs "Low Risk."

  5. Code Review Assistant: Show 3 ways to fix a bug and ask the AI to pick the most efficient/performant solution.


🎓 Specialized Domains

  1. Medical Triage Simulation: Present 3 patient descriptions and ask the AI to prioritize the most critical case for a Roleplay scenario.

  2. Educational Level Assessment: Provide 3 explanations of Quantum Physics and ask which one is most appropriate for a "Middle School Student" level.

  3. Conflict Resolution: Present 4 potential responses to a workplace conflict and ask which one is most likely to de-escalate the situation.

  4. Financial Risk Assessment: Provide 3 loan applications and ask the AI to identify the one with the highest potential for default based on specific heuristics.

  5. Creative Writing Plot Choice: Provide 3 narrative "hooks" and ask a "Fantasy Fan" persona which one they would be most excited to read further.


Was this article helpful?
© 2024-2026 | All Rights Reserved. Hikoky