Abstract:In the facility tomato planting environment, the accuracy of automatic fruit picking can be affected by overlapping and occlusion of fruits. An instance segmentation model was proposed based on YOLACT to address this issue. Firstly, the categories of fruit overlap and occlusion were subdivided, and the dataset of this type was increased to simulate real picking scenes and improve recognition accuracy in picking decisions. Secondly, the Simple Copy-Paste data enhancement method was employed to enhance the model’s generalization ability and reduce the interference of environmental factors on instance segmentation. Next, based on YOLACT, multiscale feature extraction technology was used to overcome the limitation of single-scale feature extraction and reduce the complexity of the model. Finally, the Swin-S attention mechanism in Swin Transformer was incorporated to optimize the detailed feature extraction effect for tomato instance segmentation. Experimental results demonstrated that this model can alleviate the problems of missed detection and false detection in segmentation results to a certain extent. It achieved an average target detection accuracy of 93.9%, which was an improvement of 10.4, 4.5, 16.3, and 3.9 percentage points compared with that of YOLACT, YOLO v8-x, Mask R-CNN and InstaBoost, respectively. Additionally, the average segmentation accuracy was 80.6%, which was 4.8, 1.5, 7.3, and 4.3 percentage points higher than that of the aforementioned models, respectively. The inference speed of this model was 25.6f/s. Overall, this model exhibited stronger robustness and real-time performance in terms of comprehensive performance, effectively addressing both accuracy and speed requirements. It can serve as a valuable reference for tomato picking robots in performing visual tasks.