[Question] How to integrate MagicPIG with other benchmark frameworks mentioned in the paper?

# How to integrate MagicPIG with other benchmark frameworks mentioned in the paper?

## 🙏 Acknowledgment

First, thank you @dreaming-panda for this excellent work! MagicPIG presents a compelling approach to efficient LLM generation using LSH sampling, and the results shown in the paper are very impressive.

## 🤔 Issue Description

I'm trying to reproduce the experimental results mentioned in your paper, but I'm encountering difficulties integrating MagicPIG with the benchmark frameworks used in your evaluation. 

According to the paper, you evaluated MagicPIG on **three major categories of tasks**:

### 1. **Mid-context comprehensive tasks** from lm-eval-harness
- GSM8K-CoT (Cobbe et al., 2021)
- MMLU-Flan-Cot-Fewshot (Hendrycks et al., 2020) 
- COQA (Reddy et al., 2019)

### 2. **Long context tasks** from (Bai et al., 2023)
- QASPER (Dasigi et al., 2021)
- LCC
- epobench-P (Liu et al., 2023)
- TriviaQA (Joshi et al., 2017)
- PRE
- TREC (Li and Roth, 2002; Hovy et al., 2001)

### 3. **Synthetic tasks** from RULER (Hsieh et al., 2024)
- 13 synthetic tasks with 50 examples per task

## 🔍 Current Status

Looking at the current repository, I can see that:

✅ **RULER integration is provided** - There are clear instructions and scripts in `evaluations/RULER/` for running RULER benchmarks with MagicPIG

❌ **Missing integrations for other benchmarks** - There's no guidance on how to use MagicPIG with:
- lm-eval-harness for the mid-context tasks
- The long context evaluation framework used for the 6 tasks from Bai et al., 2023

## 🎯 Specific Questions

1. **lm-eval-harness Integration**: 
   - How can I integrate MagicPIG with lm-eval-harness to evaluate on GSM8K-CoT, MMLU-Flan-Cot-Fewshot, and COQA?
   - Do I need to modify the lm-eval-harness codebase or create a custom adapter?
   - Are there specific configuration files or scripts that should be used?

2. **Long Context Tasks Integration**:
   - What evaluation framework did you use for the 6 long context tasks from Bai et al., 2023?
   - How can I reproduce these evaluations with MagicPIG?
   - Are there specific preprocessing steps or evaluation scripts needed?

3. **General Integration Pattern**:
   - Is there a general pattern or API that can be used to integrate MagicPIG with arbitrary evaluation frameworks?
   - Could you provide a minimal example showing how to wrap MagicPIG for use with other benchmarking tools?



## 🎯 Impact

This missing documentation/code is preventing researchers from:
- Fully reproducing the paper's results
- Comparing MagicPIG with other methods on the same benchmarks
- Adopting MagicPIG for their own evaluation pipelines

Thank you for your time and consideration!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] How to integrate MagicPIG with other benchmark frameworks mentioned in the paper? #18

How to integrate MagicPIG with other benchmark frameworks mentioned in the paper?

🙏 Acknowledgment

🤔 Issue Description

1. Mid-context comprehensive tasks from lm-eval-harness

2. Long context tasks from (Bai et al., 2023)

3. Synthetic tasks from RULER (Hsieh et al., 2024)

🔍 Current Status

🎯 Specific Questions

🎯 Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] How to integrate MagicPIG with other benchmark frameworks mentioned in the paper? #18

Description

How to integrate MagicPIG with other benchmark frameworks mentioned in the paper?

🙏 Acknowledgment

🤔 Issue Description

1. Mid-context comprehensive tasks from lm-eval-harness

2. Long context tasks from (Bai et al., 2023)

3. Synthetic tasks from RULER (Hsieh et al., 2024)

🔍 Current Status

🎯 Specific Questions

🎯 Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions