Skip to main content

Select Test Suite

Select the appropriate test suite - which tests the model.

Alt text

We offer this test suite:

Test suitePurposeBest forTasks
StandardEvaluate models using your own dataset.Internal benchmarks, domain-specific tasks (e.g., finance, medical, …)- Test similarity
  • BLEU
  • Fuzzy match
  • ROUGE-1
  • ROUGE-2
  • ROUGE-L
  • ROUGE-LSUM |
    | Nejumi Leaderboard 3 | Benchmark LLMs, especially for Japanese language tasks.

Reference: Nejumi Leaderboard 3 | Comparing LLMs on Japanese language tasks. | - Jaster

  • JBBQ
  • JtruthfulQA |
    | LM Evaluation Harness | General framework to benchmark language models across many standard NLP benchmarks.

Reference: LM Evaluation Harness | Evaluating English-centric LLMs and ensuring comparability with research literature | - ARC

  • GSM8K
  • HellaSwag
  • HumanEval
  • IFEval
  • LAMBADA
  • MMLU
  • OpenBookQA
  • PIQA
  • SciQ
  • TruthfulQA
  • WinoGrande |
    | VLM Evaluation Kit | Evaluate VLMs (Vision-Language Models) on multimodal tasks.

Reference: VLMEvalKit | Testing multimodal models | - ChartQA

  • DocVQA
  • InfoVQA
  • MTVQA
  • OCRBench |