Thank you for your great work on InternAgent. The project demonstrates strong performance on reasoning benchmarks such as HLE, which is very impressive.
However, I was not able to find the evaluation code or scripts for running InternAgent on HLE (or similar reasoning benchmarks) in the repository. It would be extremely helpful if the team could share the corresponding evaluation pipeline, including data preprocessing, prompt templates, and inference settings, to facilitate reproducibility and further research.