AI Benchmarks


AI benchmarks are standardized datasets, tests, or evaluation methods used to measure the performance of various AI systems.

MMLU (opens in a new tab)Tests a model's ability to perform well on a wide range of tasks across 57 different domains like math, history, law, and more.
HellaSwag (opens in a new tab)Challenges LLMs to demonstrate commonsense reasoning and inference abilities.
PIQA (opens in a new tab)Evaluates a model's ability to answer science questions grounded in physical intuition and world knowledge.
SIQA (opens in a new tab)Assesses an LLM's commonsense reasoning and understanding of social situations.
BoolQ (opens in a new tab)Measures models on yes/no questions that often require complex reasoning.
Winogrande (opens in a new tab)A challenging benchmark focusing on commonsense reasoning by resolving pronoun ambiguity.
CQA (opens in a new tab)Tests conversational question answering where LLMs need to follow the flow of conversational history.
OBQA (opens in a new tab)Evaluates a model's ability to answer open-ended questions requiring factual knowledge retrieval.
ARC-e/ARC-c (opens in a new tab)A set of science exam questions measuring reasoning and understanding. 'e' stands for easy and 'c' for challenging.
TriviaQA (opens in a new tab)Assesses LLMs on open-domain trivia questions obtained from real sources.
NQ (opens in a new tab)Evaluates question answering on challenging real-world Google search queries.
HumanEval (opens in a new tab)Involves direct human judgment of LLM-generated text for coherence, relevance, and other qualities.
MBPP (opens in a new tab)Examines model performance on different mathematical problem-solving subtasks.
GSM8K (opens in a new tab)Evaluates LLMs on challenging multi-step grade-school mathematical problems.
MATH (opens in a new tab)Another dataset for assessing mathematical reasoning skills in language models.
AGIEval (opens in a new tab)Tests an LLM's ability to reason and answer questions based on scenes and images.
BBH (opens in a new tab)Benchmark for logical, multi-hop reasoning on different types of relations.