Abstract:Rice pest and disease information mostly originates from unstructured text. These texts contain densely nested entities, lengthy sentences, and complex grammatical structures. Because of this, current named entity recognition (NER) methods struggle to fully identify the relevant entities. To solve this problem, the AgriRoBERTa-BiLSTM-Man-CRF model was proposed. Firstly, a pre-trained corpus in the agricultural domain and a labeled dataset for rice pest and disease named entity recognition were constructed. This provided high-quality data for model training. Secondly, pre-training RoBERTa on agricultural texts was continued by using whole-word masking. This approach enabled the model to focus on the complete meaning of Chinese words and learn the specific language patterns found in texts about rice diseases and pests. Finally, Manhattan attention mechanism was introduced to capture sparse features in high-dimensional space by using L1-distance. This approach quantified feature differences while precisely focusing on critical contextual information, so as to improve the accuracy of entity boundary recognition. Experimental results showed that the proposed algorithm achieved an F1 score of 90.69%, with a precision of 87.87% and a recall of 93.71% for entity recognition. The F1 score was 7.8, 9.99, 1.8, 15.9 percentage points higher than that of four conventional models: BiLSTM-CRF, BiLSTM-Attention-CRF, BERT-BiLSTM-CRF and IDCNN-CRF. This enhanced performance enabled more effective recognition of diverse entities in rice pest and disease texts. This significant improvement indicated that the model can recognize various entities in rice pest and disease texts more effectively.