Finding Blindspots in LLM Evaluations with Interpretable Checklists

Description

We present FBI, our novel meta-evaluation framework designed to assess the robustness of evaluator LLMs across diverse tasks and evaluation strategies.

Downloads
Resource namelink
DatasetFBI
Details

Tasks

We manually categorized each prompt into one of the 4 task categories:

How to use from the Datasets library

from datasets import load_dataset

ds = load_dataset("ai4bharat/FBI", "factual")
from datasets import load_dataset

ds = load_dataset("ai4bharat/FBI", "instruction-following")
from datasets import load_dataset

ds = load_dataset("ai4bharat/FBI", "long-form")

Citation

If you used this repository or our models, please cite our work:

@article{doddapaneni2024finding,
  title   = {Finding Blind Spots in Evaluator LLMs with Interpretable Checklists},
  author  = {Sumanth Doddapaneni and Mohammed Safi Ur Rahman Khan and Sshubam Verma and Mitesh M. Khapra},
  year    = {2024},
  journal = {arXiv preprint arXiv: 2406.13439}
}
@misc{
    to be updated
} -->