Abstract:To further improve the recognition accuracy and speed of truss-harvested cherry tomatoes, targeting the scenario of automated tomato harvesting in facility environments, a lightweight cherry tomato truss recognition model was proposed based on an improved transformer. Firstly, a cherry tomato dataset encompassing various lighting conditions and harvesting postures was constructed, and the postures of cherry tomato trusses were categorized. Then a lightweight trussharvested cherry tomato recognition model based on an improved RE-DETR was proposed. This model introduced a lightweight backbone network, EfficientViT, to replace the original backbone of RT-DETR, which significantly reduced model parameters and computational complexity. Additionally, an adaptive detail fusion module was designed to efficiently process and merge feature maps of different scales while further lowered computational complexity. Finally, a weighted function sliding mechanism and exponential moving average concept were introduced to optimize the loss function, which addressed uncertainties in sample classification. Experimental results demonstrated that this lightweight model achieved high recognition accuracy (90.00%) while enabled fast detection (41.2f/s) and low computational cost (8.7×109 FLOPs). Compared with that of the original network model, Faster R-CNN, and Swin Transformer, the average recognition accuracy was improved by 1.24%~15.38%, the frames processed per second (FPS) was increased by 25.61%~255.17%, while simultaneously achieved a substantial reduction of 69.37%~92.37% in floating-point operations. The model exhibited strong robustness in overall performance, balancing accuracy and speed, and can serve as a reference for tomato harvesting robots in completing visual tasks.