CodeScope
An Execution-based Multilingual Multitask Multidimensional Benchmark
for Evaluating LLMs on Code Understanding and Generation
(2023)
CodeScope, an execution-based, multilingual, multi-task, multi-dimensional evaluation benchmark for comprehensively gauging LLM capabilities on coding tasks. CodeScope covers 43 programming languages and 8 coding tasks. It evaluates the coding performance of LLMs from three dimensions (perspectives): difficulty, efficiency, and length.
Category | Task | Detailed Result | #Languages | #Test Samples | Avg. #Tokens/Sample |
---|
Ranking | Model | Organization | CodeScope | CodeScope | CodeScope |
---|---|---|---|---|---|
(Understanding) | (Generation) | (Overall) |
@misc{yan2023codescope, title={CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation}, author={Weixiang Yan and Haitian Liu and Yunkun Wang and Yunzhe Li and Qian Chen and Wen Wang and Tingyu Lin and Weishan Zhao and Li Zhu and Shuiguang Deng and Hari Sundaram}, year={2023}, eprint={2311.08588}, archivePrefix={arXiv}, primaryClass={cs.CL} }
When you build an LLM that meets your expectations, you can submit your test results for official reporting. To submit your model evaluation results on CodeScope, please submit the following information:
Submit the above information by email to yanweixiang.ywx@gmail.com and we will respond to your email within 72 hours.
Have any questions about CodeScope? Please contact us at yanweixiang.ywx@gmail.com or create an issue on Github.