Agri-Eval:农业领域大语言模型多层次评估基准
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

国家重点研发计划项目(2024YFD2000805)


Agri-Eval:Multi-level Large Language Model Valuation Benchmark for Agriculture
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    利用基准数据集对模型进行评估,是衡量大语言模型(LLMs)在特定领域能力的重要方法,主要用于评测其知识水平与推理能力。为更好地评估大语言模型在农业领域的能力,本文提出了 Agri-Eval:一个用于评估农业领域大语言模型知识与推理能力的基准。Agri-Eval的评测数据集涵盖农业领域7个主要学科:作物科学、园艺学、植物保护学、畜牧学、林学、水产科学和草业科学,共包含2283道试题。在国内通用大语言模型中,DeepSeek-R1 表现最佳,准确率达 75.49%;在国际通用大模型中,Gemini-2.0-Pro-exp-02-05以74.28% 的准确率位居首位。作为农业垂直领域大模型,神农V2.0(Shennong V2.0) 的综合表现超越了所有国内通用大模型,其在农业知识问答的准确率亦优于所有现有的通用模型。Agri-Eval的发布有助于开发者通过多样化任务与测试,全面评估模型在农业领域的综合能力,从而推动农业领域大语言模型的发展。

    Abstract:

    Model evaluation using benchmark datasets is an important method to measure the capability of large language models (LLMs) in specific domains, and it is mainly used to assess the knowledge and reasoning abilities of LLMs. Therefore, in order to better assess the capability of LLMs in the agricultural domain, Agri-Eval was proposed as a benchmark for assessing the knowledge and reasoning ability of LLMs in agriculture. The assessment dataset used in Agri-Eval covered seven major disciplines in the agricultural domain: crop science, horticulture, plant protection, animal husbandry, forest science, aquaculture science, and grass science, and contained a total of 2283 questions. Among domestic general-purpose LLMs, DeepSeek-R1 performed best with an accuracy rate of 75.49%. In the realm of international general-purpose LLMs, Gemini-2.0-pro-exp-02-05 standed out as the top performer, achieving an accuracy rate of 74.28%. As an LLMs in agriculture vertical, Shennong V2.0 outperformed all the LLMs in China, and the answer accuracy rate of agricultural knowledge exceeded that of all the existing general-purpose LLMs. The launch of Agri-Eval helped the LLM developers to comprehensively evaluate the model’s capability in the field of agriculture through a variety of tasks and tests to promote the development of the LLMs in the field of agriculture.

    参考文献
    相似文献
    引证文献
引用本文

王耀君,葛明亮,徐国威,张齐豫,别宇辉. Agri-Eval:农业领域大语言模型多层次评估基准[J].农业机械学报,2026,57(1):290-299. WANG Yaojun, GE Mingliang, XU Guowei, ZHANG Qiyu, BIE Yuhu. Agri-Eval:Multi-level Large Language Model Valuation Benchmark for Agriculture[J]. Transactions of the Chinese Society for Agricultural Machinery,2026,57(1):290-299.

复制
分享
相关视频

文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-09-21
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2026-01-01
  • 出版日期:
文章二维码