Abstract:Aiming to address the heavy reliance on large amounts of annotated data in fruit object detection tasks such as pomelo, a pomelo tree image generative data augmentation method was proposed based on vision-language models. The approach required only 3~5 unlabeled real images to generate a large-scale labeled dataset, which can be used to train object detection models and enhance their performance in zero-shot and few-shot scenarios.The method consisted of the following three main stages. Firstly, real pomelo tree components (including fruits, leaves) were extracted from unlabeled images by using the grounded segment anything model (Grounded SAM). Secondly, stable diffusion was used to create diverse background images based on textual descriptions, increasing the complexity and variability of the training data. Thirdly, a modified fractal tree algorithm was employed to construct structurally diverse pomelo trees, integrating real components with synthetic backgrounds to produce a variety of tree images and corresponding automatic annotations. Experimental results on pomelo object detection by using the YOLO v10 model (Nano version) showed that the proposed method improved mAP50-95 performance by 662.3%, 24.9%, 13.7%, 8.8%, and 1.8% when the number of real training images was 0, 8, 16, 32, and 64, respectively. With 221 real and 512 generated images, the model achieved optimal performance: precision was 76.9%, recall was 62.7%, mAP50 was 70.3%, and mAP50-95 was 38.4%. When transferred to orange detection tasks under the same data conditions, performance gains were 212.9%, 16.5%, 14.0%, 5.2%, and 4.1%. With 1302 real and 512 generated images, the model achieved the best overall performance: precision was 90.3%, recall was 87.8%, mAP50 was 94.0%, and mAP50-95 was 54.0%, demonstrating strong generalization ability. Compared with tree images generated with blank backgrounds, the proposed method consistently outperformed across all training set sizes, whereas the blank-background approach only excelled in the zero-shot setting. Against traditional data augmentation techniques such as mosaic, this method performed better under low-shot conditions in pomelo detection, and although not the best in orange detection for every individual case, it achieved the best overall results under the default configuration of Ultralytics YOLO. In summary, the proposed method effectively mitigated the limitations caused by insufficient labeled data in fruit object detection model training and offered promising practical value and scalability.