SCORE Leaderboard

We introduce SCORE - an open and holistic evaluation framework for LLMs centered on robustness i.e. the ability to produce consistent responses when the input is rephrased or presented in a slightly different way. Prediction consistency is particularly crucial for factual questions where an objective answer exists. Note that it is expected that the predictions are equivalent and not necessarily correct. Models are evaluated multiple times in equivalent setups and accuracy range along with prediction consistency rate is reported. Contrary to a single accuracy metrics (often derived from an optimized setup) reported during model releases, this better simulates human interaction setups and provides better estimate of real world performance. Furthermore, models are evaluated using the same setup which makes model comparison possible.

Tasks

Prompt Robustness - Models are evaluated on ten different prompts. For multiple choice question (MCQ) datasets, prompts ask the model to choose the right option letter. For MATH, prompts ask the model to solve the problem. The prompt set is diverse enough to cover various content and formatting styles that the model may encounter in real life, they are not adversarial or tuned in any way. Prompts are semantically close, vary by instruction and level of response details. Prompts end with final answer formatting instructions. We include both CoT and non-CoT prompts and vary the placement of the question in the prompt to be either in the beginning, in the middle, or at the end of the prompt.

Non Greedy Inference - We study the effect of random seed during non-greedy inference. For factual questions the model's underlying distribution should be sharp enough to be independent of the random seed for the next token sampling. There is an inherent randomness in the answer generation process, which may affect the "path" model takes to arrive at an answer.

Choice Order Robustness - We test models against changes in the order of choices for MCQ datasets. We swap the order of choices and ensure the correct answer is always the same option (all correct answers are A or B, etc). Changing the order of choices does not change the input's semantics, and it is expected that the models will be robust against such minimal change.

Datasets

MMLU Pro - Massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities.
AGIEval - Dataset specifically designed to assess foundation model in the context of human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests.
MATH - Challenging competition mathematics problems

Metrics

Accuracy - We report macro accuracy for MMLU Pro and micro accuracy for AGIEval and MATH. For all datasets, average (minimum, maximum) accuracy across all experiments is reported
Consistency Rate - We use the consistency rate (CR) to measure the stability of model predictions. CR calculates the proportion of consistent prediction pairs for each data point.

{

"headers": [
- "Model",
- "Average CR⬆️",
- "AGIEval Mean (Min, Max)",
- "AGIEval CR",
- "MMLU-Pro Mean (Min, Max)",
- "MMLU-Pro CR",
- "Math Mean (Min, Max)",
- "Math CR",
- "#Params (B)"
],
"data": [
- [
  - "<a target="_blank" href="https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">meta-llama/Llama-3.1-70B-Instruct</a>",
  - 72.39,
  - "72.43, (65.34, 74.66)",
  - 81.79,
  - "66.63, (55.16, 70.68)",
  - 73.19,
  - "65.88, (64.58, 67.86)",
  - 62.18,
  - 0
  ],
- [
  - "<a target="_blank" href="https://huggingface.co/mistralai/Mistral-Large-Instruct-2407" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">mistralai/Mistral-Large-Instruct-2407</a>",
  - 71.93,
  - "68.78, (61.41, 74.49)",
  - 75.77,
  - "65.1, (50.28, 69.23)",
  - 72.31,
  - "71.04, (69.66, 72.72)",
  - 67.71,
  - 0
  ],
- [
  - "<a target="_blank" href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">meta-llama/Meta-Llama-3-70B-Instruct</a>",
  - 69.11,
  - "69.71, (60.77, 71.2)",
  - 83.13,
  - "58.75, (49.3, 63.16)",
  - 75.24,
  - "51.29, (49.66, 54.2)",
  - 48.96,
  - 0
  ],
- [
  - "<a target="_blank" href="https://huggingface.co/01-ai/Yi-1.5-34B-Chat" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">01-ai/Yi-1.5-34B-Chat</a>",
  - 58.43,
  - "63.89, (50.85, 70.98)",
  - 69.95,
  - "49.91, (36.47, 55.76)",
  - 57.31,
  - "53.46, (51.7, 54.42)",
  - 48.04,
  - 0
  ],
- [
  - "<a target="_blank" href="https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">meta-llama/Llama-3.1-8B-Instruct</a>",
  - 52.74,
  - "54.59, (44.62, 59.66)",
  - 62.54,
  - "45.3, (32.34, 51.94)",
  - 52.79,
  - "49.21, (46.88, 51.18)",
  - 42.9,
  - 0
  ],
- [
  - "<a target="_blank" href="https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">mistralai/Mistral-Nemo-Instruct-2407</a>",
  - 49.46,
  - "51.57, (38.46, 63.8)",
  - 58.7,
  - "40.63, (31.49, 47.65)",
  - 51.43,
  - "42.91, (40.72, 45.22)",
  - 38.26,
  - 0
  ]
],
"metadata": null

}

SCORE Leaderboard

Tasks

Datasets

Metrics

How to Evaluate on SCORE?