| 摘要: |
| 传统数据流聚类方法缺乏对高维数据的在线降维能力,导致其聚类性能受限。为解决此问题,提出了一种基于可扩展子空间学习的数据流聚类方法(Scalable Subspace Learning for Clustering Data Streams,S2狶CStream。首先,通过可扩展子空间学习建立历史数据与新增数据之间的投影关系,将新增数据投影至历史数据张成的子空间中,以实时获取其聚类划分。其次,为保持不同时刻聚类划分的准确性,对持续到达的数据流进行数据分布的一致性检测,捕获其中存在的概念漂移,并结合回溯机制对聚类划分进行调整以适应动态变化的数据分布。最后,通过在多个真实数据集上进行测试,验证了所提方法在处理高维数据流的效能。具体而言,S2狶CStream在保持较高聚类准确性的同时,在应对概念漂移时,处理时间明显优于EmCStream。 |
| 关键词: 数据流聚类 子空间学习 可扩展子空间学习 概念漂移检测 |
| DOI:10.20079/j.issn.1001-893x.240618002 |
|
| 基金项目:国家自然科学基金资助项目(62206094;湖州市公益性应用研究项目(2021GZ05;江苏省网络空间安全工程实验室开放课题(SDGC2237;湖州师范学院研究生科研创新项目(2024KYCX62 |
|
| Scalable Subspace Learning for Clustering Data Streams |
| YIN Hongwei,NI Yuzhou,HU Wenjun |
| (1.School of Information Engineering,Huzhou University,Huzhou 31300,China;2.Zhejiang Province Key Laboratory of Smart Management and Application of Modern Agricultural Resources,Huzhou 313000,China;3.Huzhou Key Laboratory of Aquatic Robot Technology,Huzhou 313000,China) |
| Abstract: |
| Traditional data stream clustering methods lack online dimensionality reduction capabilities for high-dimensional data,leading to limited clustering performance.To address this issue,a Scalable Subspace Learning for Clustering Data Streams(S2狶CStream method is proposed.Firstly,this method establishes a projection relationship between historical data and new data through scalable subspace learning,projecting the new data into the subspace spanned by historical data to obtain its clustering assignment in real-time.Secondly,to maintain the accuracy of clustering assignments over time,the method performs consistency detection of data distribution on the continuously arriving data stream,capturing concept drifts and adjusting clustering assignments through a backtracking mechanism to adapt to dynamically changing data distributions.Finally,the proposed method is validated on multiple real-world datasets,demonstrating its efficiency in handling high-dimensional data streams.Specifically,S2狶CStream maintains high clustering accuracy while significantly outperforming EmCStream in processing time when handling concept drift. |
| Key words: data stream clustering subspace learning scalable subspace learning concept drift detection |