| 摘要: |
| 为了进一步提高人体动作识别的精度和充分发掘动作序列的时空特征,提出了基于时空特征融合与注意力机制的图卷积动作识别方法。采用空间注意力图卷积对拓扑图进行通道级细化,捕捉不同运动类型下关节的相关性特征,并采用时域多尺度图卷积模块扩展时间卷积结构以捕获多尺度时间特征。构建多层次特征融合模块将初始特征与时域多尺度图卷积输出特征作为模块输入,采用双分支结构分别获取全局和局部通道特征,并在通道维度进行时空特征融合以增强模型特征提取能力;在此基础上,提出一种肢体注意力机制对人体拓扑结构进行划分并分别计算其在通道维度上的注意力权重,加强模型对局部动作特征的关注能力。实验结果表明,在NTU RGB+D数据集的CS和CV评估模式下分别达到了93.0%和96.9%的识别准确率,在NTU RGB+D 120数据集的X-Sub和X-Set评估模式下分别达到了89.8%和91.1%的识别准确率,均高于ST-GCN、CTR-GCN等模型的识别准确率。 |
| 关键词: 动作识别 人体骨架 图卷积 时空特征融合 注意力机制 |
| DOI:10.20079/j.issn.1001-893x.240722001 |
|
| 基金项目:西安市科技计划项目(2020KJRC0070 |
|
| Graph Convolution Action Recognition Based on Spatiotemporal Feature Fusion and Attention Mechanism |
| WANG Xiaolu,TAN Yonghui,LI Xiaoting |
| (School of Communication and Information Engineering,Xi’an University of Science and Technology,Xi’an 710054,China) |
| Abstract: |
| In order to further improve the accuracy of human action recognition and fully explore the spatiotemporal features of action sequences,a graph convolution action recognition method based on spatiotemporal feature fusion and attention mechanism is proposed.The spatial attention map convolution is used to refine the topology to capture the correlation features of the joints under different motion types,and the time convolution structure is extended by the time domain multi-scale convolution module to capture the multi-scale time features.A multi-level feature fusion module is constructed,which takes the initial feature and the convolution output feature of the time-domain multiscale graph as the module input,and uses a two-branch structure to obtain the global and local channel features respectively.On this basis,a limb attention mechanism is proposed to divide the human topological structure and calculate the attention weights in the channel dimension respectively to enhance the model搒 ability to pay attention to local action features.The experimental results show that the recognition accuracy is 93.0% and 96.9% in CS and CV evaluation mode of NTU RGB+D data set,and 89.8% and 91.1% in X-Sub and X-Set evaluation mode of NTU ㏑GB+D 120 data set,respectively.The recognition accuracy is higher than that of ST-GCN,CTR-GCN and other models. |
| Key words: human skeleton motion recognition graph convolution spatiotemporal feature fusion attention mechanisms |