Select Test Suite

Select the appropriate test suite - which tests the model.

We offer this test suite:

Test suite	Purpose	Best for	Tasks
Standard	Evaluate models using your own dataset.	Internal benchmarks, domain-specific tasks (e.g., finance, medical, …)	- Test similarity

BLEU
Fuzzy match
ROUGE-1
ROUGE-2
ROUGE-L
ROUGE-LSUM |
| Nejumi Leaderboard 3 | Benchmark LLMs, especially for Japanese language tasks.

Reference: Nejumi Leaderboard 3 | Comparing LLMs on Japanese language tasks. | - Jaster

JBBQ
JtruthfulQA |
| LM Evaluation Harness | General framework to benchmark language models across many standard NLP benchmarks.

Reference: LM Evaluation Harness | Evaluating English-centric LLMs and ensuring comparability with research literature | - ARC

GSM8K
HellaSwag
HumanEval
IFEval
LAMBADA
MMLU
OpenBookQA
PIQA
SciQ
TruthfulQA
WinoGrande |
| VLM Evaluation Kit | Evaluate VLMs (Vision-Language Models) on multimodal tasks.

Reference: VLMEvalKit | Testing multimodal models | - ChartQA