quotation:		[Copy]
		[Copy]

This Paper:Browse 161 Download 74
基于可扩展子空间学习的数据流聚类方法
尹宏伟,倪钰洲,胡文军
0 Fontlarge +\|Default\|Small
(1.湖州师范学院信息工程学院，浙江湖州 313000;2.浙江省现代农业资源智慧管理与应用研究重点实验室，浙江湖州 313000;3.湖州市水域机器人技术重点实验室，浙江湖州 313000)

摘要:

传统数据流聚类方法缺乏对高维数据的在线降维能力，导致其聚类性能受限。为解决此问题，提出了一种基于可扩展子空间学习的数据流聚类方法（Scalable Subspace Learning for Clustering Data Streams,S2狶CStream。首先，通过可扩展子空间学习建立历史数据与新增数据之间的投影关系，将新增数据投影至历史数据张成的子空间中，以实时获取其聚类划分。其次，为保持不同时刻聚类划分的准确性，对持续到达的数据流进行数据分布的一致性检测，捕获其中存在的概念漂移，并结合回溯机制对聚类划分进行调整以适应动态变化的数据分布。最后，通过在多个真实数据集上进行测试，验证了所提方法在处理高维数据流的效能。具体而言，S2狶CStream在保持较高聚类准确性的同时，在应对概念漂移时，处理时间明显优于EmCStream。

关键词: 数据流聚类子空间学习可扩展子空间学习概念漂移检测

DOI：10.20079/j.issn.1001-893x.240618002

基金项目:国家自然科学基金资助项目（62206094；湖州市公益性应用研究项目(2021GZ05；江苏省网络空间安全工程实验室开放课题（SDGC2237；湖州师范学院研究生科研创新项目（2024KYCX62

Scalable Subspace Learning for Clustering Data Streams

YIN Hongwei,NI Yuzhou,HU Wenjun

(1.School of Information Engineering,Huzhou University,Huzhou 31300,China;2.Zhejiang Province Key Laboratory of Smart Management and Application of Modern Agricultural Resources,Huzhou 313000,China;3.Huzhou Key Laboratory of Aquatic Robot Technology,Huzhou 313000,China)

Abstract:

Traditional data stream clustering methods lack online dimensionality reduction capabilities for high-dimensional data,leading to limited clustering performance.To address this issue,a Scalable Subspace Learning for Clustering Data Streams(S2狶CStream method is proposed.Firstly,this method establishes a projection relationship between historical data and new data through scalable subspace learning,projecting the new data into the subspace spanned by historical data to obtain its clustering assignment in real-time.Secondly,to maintain the accuracy of clustering assignments over time,the method performs consistency detection of data distribution on the continuously arriving data stream,capturing concept drifts and adjusting clustering assignments through a backtracking mechanism to adapt to dynamically changing data distributions.Finally,the proposed method is validated on multiple real-world datasets,demonstrating its efficiency in handling high-dimensional data streams.Specifically,S2狶CStream maintains high clustering accuracy while significantly outperforming EmCStream in processing time when handling concept drift.

Key words: data stream clustering subspace learning scalable subspace learning concept drift detection