LLM Leaderboard Cheat Sheet

This cheat sheet is a handy guide that helps you quickly understand and compare the top large language models (LLMs) available. It offers an up-to-date snapshot of the leading models, their performance, cost, and best-use cases as of August 15, 2024. It’s a great resource to help you make the right choice quickly and easily.

⚠️ Current large language models (LLMs) are limited compared to biological intelligence. While LLMs are powerful in processing large amounts of text data, they are still far from being "intelligent" in the way animals or humans are. Artificial intelligence (AI) still lacks the common sense, reasoning ability, and sensory perception that even a simple organism like a cat possesses.

Top Highlights

Most Preferred Model

GPT-4 Turbo (0409)

Klu Index: 100

Largest Context Window

Claude 3 Opus

200k Tokens

Most Expensive Model

GPT-4 32k (0314)

Cost: $75 / Million Tokens

Least Expensive Model

Sonar Small

Cost: $0.02 / Million Tokens

Fastest Tokens Per Second (TPS)

Llama 3 8B

TPS: 1211 Tokens/s

Fastest Time to First Token (TTFT)

Gemma 2 9B

TTFT: 0.21s


Top Models by Use Case

Code & Reasoning

GPT-4 Turbo (0409)

AI Applications

GPT-4 Omni (0807)

Chat & Vision

Claude 3.5 Sonnet

Creative Content

Claude 3 Opus

Data Labeling

Gemini Flash 1.5


Benchmark Performance

BigCodeBench (Coding)

Top model - Claude 3.5 Sonnet, 47.69% Average

MATH (Mathematical Proficiency)

GPT-40 (0513) - 76.6%

GPQA (Expert Knowledge)

GPT-4 Turbo (0409) - 72.2%

MMMU (Vision)

Gemini Flash 1.5 - 34.85%

GAIA (AI Assistants)

Gemini Pro 1.5 - 62.2%

QUAKE (Productivity)

Claude 3.5 Sonnet - 31.82%


Cost & Speed Overview

Cheapest per Million Tokens

Sonar Small ($0.02)

Most Expensive per Million Tokens

GPT-4 32k (0314) ($75)

Fastest Speed (TPS)

Llama 3 8B (1211 Tokens/s)

Quickest TTFT

Gemma 2 9B (0.21s)


Provider-Specific Insights

OpenAI

Leading with GPT-4 Turbo, high performance across most benchmarks.

Anthropic

Notable models include Claude 3.5 Sonnet for vision and chat.

Google

Gemini Pro 1.5 excels in productivity and data labeling.

Azure AI

Strong in coding and general knowledge with Llama 3.1 series.

Mistral

Offers affordable models with competitive performance, like Mistral Large.