Resource name | link |
---|---|
Dataset | FBI |
We present FBI, our novel meta-evaluation framework designed to assess the robustness of evaluator LLMs across diverse tasks and evaluation strategies.
Resource name | link |
---|---|
Dataset | FBI |
We manually categorized each prompt into one of the 4 task categories:
from datasets import load_dataset
ds = load_dataset("ai4bharat/FBI", "factual")
from datasets import load_dataset
ds = load_dataset("ai4bharat/FBI", "instruction-following")
from datasets import load_dataset
ds = load_dataset("ai4bharat/FBI", "long-form")
If you used this repository or our models, please cite our work:
@article{doddapaneni2024finding,
title = {Finding Blind Spots in Evaluator LLMs with Interpretable Checklists},
author = {Sumanth Doddapaneni and Mohammed Safi Ur Rahman Khan and Sshubam Verma and Mitesh M. Khapra},
year = {2024},
journal = {arXiv preprint arXiv: 2406.13439}
}
@misc{
to be updated
} -->