摘要: |
针对基于语义的短文本相似度计算方法在短文本分类中准确率较低这一问题,提出了结合词性的短文本相似度算法(GCSSA)。该方法在基于hownet(“知网”)语义的短文本相似度计算方法的基础上,结合类别特征词并添加关键词词性分析,对类别特征词和其他关键词的词性信息给定不同关键词以不同的权值系数,以此区别各种贡献度词项在短文本相似度计算中的重要程度。实验表明,该算法进行文本相似度计算后应用于短文本分类中较基于hownet的短文本分类算法在准确率宏平均和微平均上提升4%左右,有效提高了短文本分类的准确性。 |
关键词: 短文本分类 短文本相似度 词性 hownet语义 分类准确率 |
DOI: |
|
基金项目:国家自然科学基金资助项目(11547148);重庆市教委科技计划项目(16SKGH133);重庆市社会科学规划博士项目(2015BS059) |
|
A grammatical category-combined short-text similarity algorithm and its application in text categorization |
HUANG Xianying,LI Qindong,LIU Yingtao |
() |
Abstract: |
To address the problem that the categorization accuracy of hownet-based short-text similarity calculation method in short-text is low,a grammatical category-combined short-text similarity algorithm(GCSSA) is proposed.Based on short-text hownet semantic similarity calculation method and combing with categorized features words,this method adds keywords grammatical category analysis,targets at categorized features words and the grammatical category information of keywords,gives different weights for different keywords,so as to differentiate the importance of various items' contribution in the text similarity calculation of short-texts. Experiments show that compared with hownet-based short-text categorization algorithm,the proposed method improves the macro-average and micro-average accuracy by 4% in short-text categorization,and improves the short-text categorization accuracy effectively. |
Key words: short text categorization short-text similarity grammatical category hownet semantic categorization accuracy |