BRAIAIN SPEED INDEX

THE GAUNTLET PROTOCOL v5.0 (HARD MODE)

The Braiain Speed Index has evolved. Standard benchmarks have become too easy for SOTA models, resulting in score saturation. Protocol v5.0 ("Hard Mode") introduces strict constraints, RAG-simulation, and data processing tasks designed to break "lazy" models.

CONTEXT INJECTION (SYSTEM LOGS)

Requests begin with a dense, multi-format payload containing Server Logs mixed with Narrative Text. This forces the model to perform "Needle in a Haystack" retrieval and context switching before generation begins.

THE "HARD MODE" PROMPT

Models must complete three distinct tasks in a single pass. Unlike previous versions, these tasks now include strict negative constraints ("Simon Says" rules) that prevent memorization.

PART 1 - CONSTRAINED JSON ("SIMON SAYS" TEST)
Task: "Generate a valid JSON object for 'Ouroboros'."
Constraints:
  1. The "origin" value must contain EXACTLY 7 words.
  2. The "purpose" value must NOT contain the letter 'e'.

Why? Most models fail to check their own output character-by-character. 
This tests true instruction adherence over "vibes."

---

PART 2 - LOGIC (DATE MATH)
Task: "If today is Wednesday, what day of the week will it be in 500 days?"

Why? The old "Bat & Ball" riddle was in every training set. 
Date math forces the model to actually calculate (500 % 7) rather than recite a memorized answer.

---

PART 3 - CODE GENERATION (DATA PROCESSING)
Task: "Write a Python function `parse_server_logs` that takes a list of strings, 
      filters for 'ERROR', and returns them sorted."

Why? Replaces the trivial Fibonacci task with a practical data engineering 
problem involving string manipulation, filtering, and sorting algorithms.
            

SCORING METHODOLOGY

🎯 QUALITY SCORE (50% weight)
Maximum: 100 points

JSON Test (0-40 points):
  • +20 for valid JSON structure
  • +10 for Origin word count (Exactly 7 words)
  • +10 for Purpose constraint (No letter 'e')

Logic Test (0-30 points):
  • +30 for Correct Answer (Saturday)
  • 0 for Wrong Answer (Wednesday/Friday guesses)

Code Test (0-30 points):
  • +10 for correct function definition
  • +10 for filter logic implementation
  • +10 for sorting logic implementation

---

⚡ SPEED & RESPONSIVENESS (50% weight)
Remains unchanged from v4.0. We value fast, streaming responses just as much as intelligence.
            

LIMITATIONS & CONSIDERATIONS

🚨 IMPORTANT DISCLAIMERS:

Geographic Variance: Results reflect US-Wisconsin performance. EU/Asia regions may differ.
Time-of-Day Effects: API performance varies with load. We test every 6 hours to average this out.
Not a Quality-Only Benchmark: A model scoring 70 might be "smarter" but slower than a 90-scoring model.
Single-Task Focus: This tests one specific workflow. Your use case may prioritize different metrics.
Provider Independence: We are not affiliated with any LLM provider. No sponsorships, no bias.
API Key Issues: Some providers fail consistently due to key/quota issues, not model performance.

Want to Run Your Own Tests?

This entire benchmark is open-source. Clone the repo, add your API keys, and run `python benchmark.py` locally. Compare your results to ours and see if geographic location affects performance.

📦 VIEW ON GITHUB

Historical Performance

RESPONSE

MODEL COMPARISON

📖 HARD MODE CONTEXT