BRAIAIN Speed Index - AI LLM Performance Benchmark Testing Claude, GPT-4, Gemini, and More

Real-world LLM benchmarks testing Speed, Quality, and Responsiveness HARD MODE ACTIVE

LAST RUN--
|
NEXT RUN--:--:--
|
NODES--
|
STATUSLIVE
|
PROTOCOLV5.0
SPONSOR THIS PROJECT: (I need API credits)
CONTACT >
SORT:
CMP RK PROVIDER MODEL SCORE SPEED TPS TTFT COST JSON LOGIC CODE OUTPUT (PREVIEW) LINK

Historical Performance

THE GAUNTLET PROTOCOL v5.0 (HARD MODE)

The Braiain Speed Index has evolved. Standard benchmarks have become too easy for SOTA models, resulting in score saturation. Protocol v5.0 ("Hard Mode") introduces strict constraints, RAG-simulation, and data processing tasks designed to break "lazy" models.

1
CONTEXT INJECTION (SYSTEM LOGS)

Requests begin with a dense, multi-format payload containing Server Logs mixed with Narrative Text. This forces the model to perform "Needle in a Haystack" retrieval and context switching before generation begins.

2
THE "HARD MODE" PROMPT

Models must complete three distinct tasks in a single pass. Unlike previous versions, these tasks now include strict negative constraints ("Simon Says" rules) that prevent memorization.

PART 1 - CONSTRAINED JSON ("SIMON SAYS" TEST) Task: "Generate a valid JSON object for 'Ouroboros'." Constraints: 1. The "origin" value must contain EXACTLY 7 words. 2. The "purpose" value must NOT contain the letter 'e'. Why? Most models fail to check their own output character-by-character. This tests true instruction adherence over "vibes." --- PART 2 - LOGIC (DATE MATH) Task: "If today is Wednesday, what day of the week will it be in 500 days?" Why? The old "Bat & Ball" riddle was in every training set. Date math forces the model to actually calculate (500 % 7) rather than recite a memorized answer. --- PART 3 - CODE GENERATION (DATA PROCESSING) Task: "Write a Python function `parse_server_logs` that takes a list of strings, filters for 'ERROR', and returns them sorted." Why? Replaces the trivial Fibonacci task with a practical data engineering problem involving string manipulation, filtering, and sorting algorithms.
3
SCORING METHODOLOGY
🎯 QUALITY SCORE (50% weight) Maximum: 100 points JSON Test (0-40 points): • +20 for valid JSON structure • +10 for Origin word count (Exactly 7 words) • +10 for Purpose constraint (No letter 'e') Logic Test (0-30 points): • +30 for Correct Answer (Saturday) • 0 for Wrong Answer (Wednesday/Friday guesses) Code Test (0-30 points): • +10 for correct function definition • +10 for filter logic implementation • +10 for sorting logic implementation --- ⚡ SPEED & RESPONSIVENESS (50% weight) Remains unchanged from v4.0. We value fast, streaming responses just as much as intelligence.
LIMITATIONS & CONSIDERATIONS
🚨 IMPORTANT DISCLAIMERS:
  • Geographic Variance: Results reflect US-Wisconsin performance. EU/Asia regions may differ.
  • Time-of-Day Effects: API performance varies with load. We test every 6 hours to average this out.
  • Not a Quality-Only Benchmark: A model scoring 70 might be "smarter" but slower than a 90-scoring model.
  • Single-Task Focus: This tests one specific workflow. Your use case may prioritize different metrics.
  • Provider Independence: We are not affiliated with any LLM provider. No sponsorships, no bias.
  • API Key Issues: Some providers fail consistently due to key/quota issues, not model performance.
Want to Run Your Own Tests?

This entire benchmark is open-source. Clone the repo, add your API keys, and run `python benchmark.py` locally. Compare your results to ours and see if geographic location affects performance.

📦 VIEW ON GITHUB
COMPARE MODELS (0/2)