Abstract:Model evaluation using benchmark datasets is an important method to measure the capability of large language models (LLMs) in specific domains, and it is mainly used to assess the knowledge and reasoning abilities of LLMs. Therefore, in order to better assess the capability of LLMs in the agricultural domain, Agri-Eval was proposed as a benchmark for assessing the knowledge and reasoning ability of LLMs in agriculture. The assessment dataset used in Agri-Eval covered seven major disciplines in the agricultural domain: crop science, horticulture, plant protection, animal husbandry, forest science, aquaculture science, and grass science, and contained a total of 2283 questions. Among domestic general-purpose LLMs, DeepSeek-R1 performed best with an accuracy rate of 75.49%. In the realm of international general-purpose LLMs, Gemini-2.0-pro-exp-02-05 standed out as the top performer, achieving an accuracy rate of 74.28%. As an LLMs in agriculture vertical, Shennong V2.0 outperformed all the LLMs in China, and the answer accuracy rate of agricultural knowledge exceeded that of all the existing general-purpose LLMs. The launch of Agri-Eval helped the LLM developers to comprehensively evaluate the model’s capability in the field of agriculture through a variety of tasks and tests to promote the development of the LLMs in the field of agriculture.