-
Notifications
You must be signed in to change notification settings - Fork 16
Description
How to integrate MagicPIG with other benchmark frameworks mentioned in the paper?
🙏 Acknowledgment
First, thank you @dreaming-panda for this excellent work! MagicPIG presents a compelling approach to efficient LLM generation using LSH sampling, and the results shown in the paper are very impressive.
🤔 Issue Description
I'm trying to reproduce the experimental results mentioned in your paper, but I'm encountering difficulties integrating MagicPIG with the benchmark frameworks used in your evaluation.
According to the paper, you evaluated MagicPIG on three major categories of tasks:
1. Mid-context comprehensive tasks from lm-eval-harness
- GSM8K-CoT (Cobbe et al., 2021)
- MMLU-Flan-Cot-Fewshot (Hendrycks et al., 2020)
- COQA (Reddy et al., 2019)
2. Long context tasks from (Bai et al., 2023)
- QASPER (Dasigi et al., 2021)
- LCC
- epobench-P (Liu et al., 2023)
- TriviaQA (Joshi et al., 2017)
- PRE
- TREC (Li and Roth, 2002; Hovy et al., 2001)
3. Synthetic tasks from RULER (Hsieh et al., 2024)
- 13 synthetic tasks with 50 examples per task
🔍 Current Status
Looking at the current repository, I can see that:
✅ RULER integration is provided - There are clear instructions and scripts in evaluations/RULER/ for running RULER benchmarks with MagicPIG
❌ Missing integrations for other benchmarks - There's no guidance on how to use MagicPIG with:
- lm-eval-harness for the mid-context tasks
- The long context evaluation framework used for the 6 tasks from Bai et al., 2023
🎯 Specific Questions
-
lm-eval-harness Integration:
- How can I integrate MagicPIG with lm-eval-harness to evaluate on GSM8K-CoT, MMLU-Flan-Cot-Fewshot, and COQA?
- Do I need to modify the lm-eval-harness codebase or create a custom adapter?
- Are there specific configuration files or scripts that should be used?
-
Long Context Tasks Integration:
- What evaluation framework did you use for the 6 long context tasks from Bai et al., 2023?
- How can I reproduce these evaluations with MagicPIG?
- Are there specific preprocessing steps or evaluation scripts needed?
-
General Integration Pattern:
- Is there a general pattern or API that can be used to integrate MagicPIG with arbitrary evaluation frameworks?
- Could you provide a minimal example showing how to wrap MagicPIG for use with other benchmarking tools?
🎯 Impact
This missing documentation/code is preventing researchers from:
- Fully reproducing the paper's results
- Comparing MagicPIG with other methods on the same benchmarks
- Adopting MagicPIG for their own evaluation pipelines
Thank you for your time and consideration!