Benchmarks
AI benchmarks are standardized datasets, tests, or evaluation methods used to measure the performance of various AI systems.
Benchmark | Description |
---|---|
MMLU (opens in a new tab) | Tests a model's ability to perform well on a wide range of tasks across 57 different domains like math, history, law, and more. |
HellaSwag (opens in a new tab) | Challenges LLMs to demonstrate commonsense reasoning and inference abilities. |
PIQA (opens in a new tab) | Evaluates a model's ability to answer science questions grounded in physical intuition and world knowledge. |
SIQA (opens in a new tab) | Assesses an LLM's commonsense reasoning and understanding of social situations. |
BoolQ (opens in a new tab) | Measures models on yes/no questions that often require complex reasoning. |
Winogrande (opens in a new tab) | A challenging benchmark focusing on commonsense reasoning by resolving pronoun ambiguity. |
CQA (opens in a new tab) | Tests conversational question answering where LLMs need to follow the flow of conversational history. |
OBQA (opens in a new tab) | Evaluates a model's ability to answer open-ended questions requiring factual knowledge retrieval. |
ARC-e/ARC-c (opens in a new tab) | A set of science exam questions measuring reasoning and understanding. 'e' stands for easy and 'c' for challenging. |
TriviaQA (opens in a new tab) | Assesses LLMs on open-domain trivia questions obtained from real sources. |
NQ (opens in a new tab) | Evaluates question answering on challenging real-world Google search queries. |
HumanEval (opens in a new tab) | Involves direct human judgment of LLM-generated text for coherence, relevance, and other qualities. |
MBPP (opens in a new tab) | Examines model performance on different mathematical problem-solving subtasks. |
GSM8K (opens in a new tab) | Evaluates LLMs on challenging multi-step grade-school mathematical problems. |
MATH (opens in a new tab) | Another dataset for assessing mathematical reasoning skills in language models. |
AGIEval (opens in a new tab) | Tests an LLM's ability to reason and answer questions based on scenes and images. |
BBH (opens in a new tab) | Benchmark for logical, multi-hop reasoning on different types of relations. |