Abstract:Image semantic segmentation technology is one of the key methods for obtaining phenotypic information of maize plants. Traditional fully supervised semantic segmentation methods typically rely on a large number of pixel-level labels. However, maize exhibits significant morphological variability across different growth stages, leading to high costs associated with image annotation and limiting the practical application of such models in real-world production scenarios. To eliminate the need for manual annotation during model training, a self-supervised few-shot semantic segmentation network for maize plant images (MSDANet) was proposed based on self-supervised learning, aiming to improve the semantic segmentation accuracy and model generalization capability of maize plant images across different growth stages. MSDANet utilized a superpixel-based self-supervised learning method to generate pseudo labels, enabling the construction of preliminary supervision signals for the support set images without manual annotation. It designed a mixed masking mechanism (MM) that applied pseudo label-based semantic masking to construct diverse masked samples in the feature space, promoting the model to learn more robust feature representations and thereby improving segmentation accuracy in complex backgrounds. To address the complex morphological issues of corn plants in images, such as bending, overlapping, and occlusion, a multi-scale deformable large kernel attention mechanism (MS-DLKA) for the model was designed. By integrating multi-scale receptive fields and deformable convolutions, it can flexibly perceive important structural information of corn plants at different scales, effectively improving semantic segmentation accuracy. When validated on a small sample dataset, MSDANet achieved mIoU and FB-IoU of 75.63% and 87.12%, respectively, in the 1-shot setting;in the 5-shot setting, mIoU and FB-IoU reached 76.04% and 87.21%, respectively, both outperforming other models of the same type proposed in this study. Additionally, compared with current mainstream fully supervised few-shot semantic segmentation models, mIoU was improved by 2.9 and 2.93 percentage points under 1-shot and 5-shot settings, respectively. The results demonstrated that the MSDANet model can achieve high-precision semantic segmentation of corn plant images without human labels and with few samples, providing technical support for corn image analysis and plant phenotyping at different growth stages.