摘要: |
针对目前人工智能(Artificial Intelligence,AI)生成文本的滥用导致的学术不端、侵犯版权、隐私保护和舆情监控等问题,提出了一种基于自然语言处理的AI生成文本的识别和检测算法。该算法首先采用Word2vec方法中的连续词袋模型将文本词转换成词向量,并将词向量累加获得文本向量。随后利用softmax函数获取文本向量的概率分布,通过统计可视化分析AI生成文本的基本规律,并采用余弦相似性来判断文本类型。其次采用支持向量机递归特征消除算法判断文本是否由AI生成,通过K-近邻算法对文本重生成次数进行判断,进一步细化了文本检测的粒度。通过仿真实验验证了算法的有效性,结果显示算法识别准确率达80%及以上。 |
关键词: AI生成文本检测 文本向量 余弦相似性 支持向量机(SVM) K-近邻(KNN)算法 |
DOI:10.20079/j.issn.1001-893x.240727001 |
|
基金项目:国家自然科学基金面上项目(61971473) |
|
An Artificial Intelligence-generated Text Recognition and Detection Method |
WANG Yuxin,LIU Kefei,LI Xuelian,WANG Hongjun |
(1.College of Information Science and Engineering,Hohai University,Changzhou 213200,China;2.Guangling College of Yangzhou University,Yangzhou 225000,China;3.College of Electronic Engineering,National University of Defense Technology,Hefei 230031,China) |
Abstract: |
To address such issues as academic dishonesty,copyright infringement stemming,privacy protection and public opinion monitoring from the misuse of artificial intelligence(AI)-generated texts,an recognition and detection algorithm based on natural language processing(NLP) is proposed.This algorithm initially converts words into vectors using the continuous bag-of-words(CBOW) model within Word2vec,and accumulates them into text vectors.It then applies softmax to address their probability distribution,analyze the fundamental patterns of AI-generated texts with statistical visualization,and determin the type of text by using cosine similarity.Next,a support vector machine recursive feature elimination(SVM-RFE) is used to determine whether the text is generated by AI.For AI-generated texts,the K-nearest neighbor(KNN) algorithm estimates the extent of AI involvement,further refining the granularity of text detection.Finally,simulation experiments show the algorithm搒 effectiveness with recognition accuracy of 80% or above. |
Key words: AI-generated text detection text vector cosine similarity support vector machine(SVM) K-nearest neighbor(KNN) algorithm |