融合视觉语言模型的柚子分形树图像生成增强方法

doi:10.6041/j.issn.1000-1298.2026.01.029

首页 > 过刊浏览>2026年第57卷第1期 >311-318，338. DOI:10.6041/j.issn.1000-1298.2026.01.029

融合视觉语言模型的柚子分形树图像生成增强方法
DOI:
                        10.6041/j.issn.1000-1298.2026.01.029
                    
CSTR:
                        
                    
作者:
                        
                        
                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:“十四五”广东省农业科技创新十大主攻方向“揭榜挂帅”项目（2024KJ27）、广州市重点研发计划项目（2024B03J1355）和2025年度嘉应学院科研项目（325E0317）

Pomelo Fractal Tree Image Generative Data Augmentation Method Using Vision-language Models

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

为了降低柚子等水果目标检测对大量标注数据的依赖，本文提出了一种融合视觉语言模型的柚子分形树图像生成增强方法。该方法仅需3~5幅无标注真实图像，即可在无训练条件下生成大规模带标注的训练数据集。首先利用基于文本提示的零样本分割模型（Grounded segment anything model，Grounded SAM）提取柚树组件，然后结合稳定扩散模型Stable Diffusion使用文本提示生成随机背景，最后使用改进的分形树算法生成柚树以提升多样性及真实感。试验采用YOLO v10轻量化版本进行验证，在自建的非结构化环境柚子目标检测数据集上，当训练集真实图像数量分别为0、8、16、32、64幅时，使用本文方法后模型多阈值平均精度均值（Mean average precision at intersection over union thresholds from 050 to 095，mAP50-95）提升率依次达到662.3%、24.9%、13.7%、8.8%、1.8%。当训练集中真实图像数量为221幅，生成图像数量为512幅时，模型达到最优性能：精确率为76.9%，召回率为62.7%，mAP50为70.3%，mAP50-95为38.4%。迁移到橙子目标检测任务，相同数据规模下的性能提升分别为212.9%、16.5%、14.0%、5.2%、4.1%。当训练集中真实图像数量为1302幅，生成图像数量为512幅时，模型同样达到最优性能：精确率为90.3%，召回率为87.8%，mAP50为94.0%，mAP50-95为54.0%。试验结果表明，该图像生成增强方法在零样本和少样本学习场景中能够有效扩展训练数据，提高YOLO v10轻量化版本目标检测的性能，并展现出良好的泛化能力。

Abstract:

Aiming to address the heavy reliance on large amounts of annotated data in fruit object detection tasks such as pomelo, a pomelo tree image generative data augmentation method was proposed based on vision-language models. The approach required only 3~5 unlabeled real images to generate a large-scale labeled dataset, which can be used to train object detection models and enhance their performance in zero-shot and few-shot scenarios.The method consisted of the following three main stages. Firstly, real pomelo tree components (including fruits, leaves) were extracted from unlabeled images by using the grounded segment anything model (Grounded SAM). Secondly, stable diffusion was used to create diverse background images based on textual descriptions, increasing the complexity and variability of the training data. Thirdly, a modified fractal tree algorithm was employed to construct structurally diverse pomelo trees, integrating real components with synthetic backgrounds to produce a variety of tree images and corresponding automatic annotations. Experimental results on pomelo object detection by using the YOLO v10 model (Nano version) showed that the proposed method improved mAP50-95 performance by 662.3%, 24.9%, 13.7%, 8.8%, and 1.8% when the number of real training images was 0, 8, 16, 32, and 64, respectively. With 221 real and 512 generated images, the model achieved optimal performance: precision was 76.9%, recall was 62.7%, mAP50 was 70.3%, and mAP50-95 was 38.4%. When transferred to orange detection tasks under the same data conditions, performance gains were 212.9%, 16.5%, 14.0%, 5.2%, and 4.1%. With 1302 real and 512 generated images, the model achieved the best overall performance: precision was 90.3%, recall was 87.8%, mAP50 was 94.0%, and mAP50-95 was 54.0%, demonstrating strong generalization ability. Compared with tree images generated with blank backgrounds, the proposed method consistently outperformed across all training set sizes, whereas the blank-background approach only excelled in the zero-shot setting. Against traditional data augmentation techniques such as mosaic, this method performed better under low-shot conditions in pomelo detection, and although not the best in orange detection for every individual case, it achieved the best overall results under the default configuration of Ultralytics YOLO. In summary, the proposed method effectively mitigated the limitations caused by insufficient labeled data in fruit object detection model training and offered promising practical value and scalability.

参考文献

相似文献

引证文献

引用本文

赖力潜,段洁利,杨洲,袁浩天.融合视觉语言模型的柚子分形树图像生成增强方法[J].农业机械学报,2026,57(1):311-318，338. LAI Liqian, DUAN Jieli, YANG Zhou, YUAN Haotian. Pomelo Fractal Tree Image Generative Data Augmentation Method Using Vision-language Models[J]. Transactions of the Chinese Society for Agricultural Machinery,2026,57(1):311-318，338.

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2025-05-16
最后修改日期:
录用日期:
在线发布日期: 2026-01-01
出版日期:

期刊浏览

EI收录结果

引用本文

分享

相关视频

文章指标

历史

文章二维码