Skip to main content

Others Tasks

Alt text

We offer the following tasks depend on the selected test suite:

Test suiteTasksDescription
Nejumi Leaderboard 3JasterMeasure the model’s ability to understand and process the Japanese language.
JBBQMeasure social bias in Japanese question answering by LLMs.
JtruthfulQAMeasure the truthfulness of model answers to Japanese questions.
LM Evaluation HarnessARCMeasure scientific reasoning on grade-school questions.
GSM8KMeasure multi-step reasoning in math word problems.
HellaSwagMeasure contextual commonsense reasoning.
HumanEvalMeasure Python code generation ability.
IFEvalMeasure instruction-following and harmful input rejection.
LAMBADAMeasure long-range context understanding.
MMLUMeasure reasoning across 57 academic/professional subjects.
OpenBookQAMeasure science QA using facts and commonsense.
PIQAMeasure physical commonsense reasoning.
SciQMeasure science multiple-choice QA for elementary & middle school levels.
TruthfulQAMeasure truthfulness in open-domain question answering.
WinograndeMeasure semantic understanding in pronoun disambiguation tasks.
VLM Evaluation KitChartQAMeasure chart-based data interpretation and question answering skills.
DocVQAMeasure question answering performance on document images.
InfoVQAMeasure question answering based on information embedded in images.
MTVQAMeasure multilingual visual-text question answering performance.
OCRBenchMeasure optical character recognition accuracy across varied datasets.