BRAIAIN Speed Index - AI LLM Performance Benchmark

Real-world LLM benchmarks testing Speed, Quality, and Responsiveness HARD MODE ACTIVE

LAST RUN--
|
NEXT RUN--:--:--
|
NODES--
|
STATUSLIVE
|
PROTOCOLV6.0
BUILD YOUR OWN BENCHMARK (Auto-Balances to 100%)
SPONSOR THIS PROJECT: (I need API credits)
CONTACT >
SORT:
CMP RK PROVIDER MODEL SCORE SPEED TPS TTFT COST TREND JSON LOGIC CODE OUTPUT (PREVIEW) LINK

Historical Trends

The Value Quadrant

Intelligence (Y) vs. Cost (X)

Speed Comparison

Total response time (lower is better)

Cost Calculator

Find your cheapest option
TESTING METHODOLOGY

Unlike standard benchmarks using empty prompts, we simulate production workloads with heavy context and strict constraints.

1
HEAVY CONTEXT INJECTION

Every request includes 2,500 characters of system logs mixed with narrative text—simulating real RAG scenarios where models must parse context before responding.

2
THREE-PART COGNITIVE TEST
TASK 1: Constrained JSON (40 pts) Generate JSON with strict rules: • "origin" field = exactly 7 words • "purpose" field = no letter 'e' Tests: Instruction adherence, self-verification TASK 2: Date Math Logic (30 pts) "If today is Wednesday, what day in 500 days?" Tests: Calculation ability (500 % 7), not memorization TASK 3: Data Processing Code (30 pts) Write parse_server_logs() function with filtering + sorting Tests: Practical coding, algorithm implementation
3
COMPOSITE SCORING
Formula: Quality (50%) + Speed (30%) + Responsiveness (20%) Quality Score (0-100): • JSON constraints: 40 pts • Logic puzzle: 30 pts • Code generation: 30 pts Speed Score: Normalized against 30s baseline Responsiveness: TTFT + Tokens/sec metrics Penalty System: • Quality < 90: Max score capped at 85 • Quality < 60: -20% multiplier + cap at 65 • Quality < 30: Hard cap at 40
LIMITATIONS & CONSIDERATIONS
🚨 IMPORTANT DISCLAIMERS:
  • Geographic Variance: Results reflect US-Wisconsin performance. EU/Asia regions may differ.
  • Time-of-Day Effects: API performance varies with load. We test every 6 hours to average this out.
  • Not a Quality-Only Benchmark: A model scoring 70 might be "smarter" but slower than a 90-scoring model.
  • Single-Task Focus: This tests one specific workflow. Your use case may prioritize different metrics.
  • Provider Independence: We are not affiliated with any LLM provider. No sponsorships, no bias.
  • API Key Issues: Some providers fail consistently due to key/quota issues, not model performance.
Want to Run Your Own Tests?

This entire benchmark is open-source. Clone the repo, add your API keys, and run `python benchmark.py` locally.

📦 VIEW ON GITHUB
COMPARE MODELS (0/2)