Introduction

CodeScope, an execution-based, multilingual, multi-task, multi-dimensional evaluation benchmark for comprehensively gauging LLM capabilities on coding tasks. CodeScope covers 43 programming languages and 8 coding tasks. It evaluates the coding performance of LLMs from three dimensions (perspectives): difficulty, efficiency, and length.

Category	Task	Detailed Result	#Languages	#Test Samples	Avg. #Tokens/Sample

Leaderboard

Ranking	Model	Organization	CodeScope	CodeScope	CodeScope
Ranking	Model	Organization	(Understanding)	(Generation)	(Overall)

Citation

@misc{yan2023codescope,
      title={CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation},
      author={Weixiang Yan and Haitian Liu and Yunkun Wang and Yunzhe Li and Qian Chen and Wen Wang and Tingyu Lin and Weishan Zhao and Li Zhu and Shuiguang Deng and Hari Sundaram},
      year={2023},
      eprint={2311.08588},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Submission

When you build an LLM that meets your expectations, you can submit your test results for official reporting. To submit your model evaluation results on CodeScope, please submit the following information:

The evaluation results of your LLM on each task of the CodeScope benchmark, please follow the evaluation metrics in our paper.
Individual/Team Organization: The name of the organization where the individual or team appears in the leaderboard.
Information about your LLM: The name of the LLM that appears in the leaderboard.
Your paper information: If the LLM is from a published work, the name and URL of the paper will appear on the leaderboard.

Submit the above information by email to yanweixiang.ywx@gmail.com and we will respond to your email within 72 hours.

Contact Us

Have any questions about CodeScope? Please contact us at yanweixiang.ywx@gmail.com or create an issue on Github.

Introduction

Related Links

Leaderboard

Citation

Submission

Contact Us