首页期刊视频编委会征稿启事出版道德声明审稿流程读者订阅论文查重联系我们English
引用本文
  • 何嘉星,李艳文,李柔,等.面向神经网络处理器的FFT算子设计与实现——以昇腾910为例[J].电讯技术,2025,(12):2160 - 2172.    [点击复制]
  • HE Jiaxing,LI Yanwen,LI Rou,et al.Design and Implementation of FFT Operator for Neural Network Processor—Taking Ascend910 as an Example[J].,2025,(12):2160 - 2172.   [点击复制]
【HTML】 【打印本页】 【下载PDF全文】 查看/发表评论下载PDF阅读器关闭

←前一篇|后一篇→

过刊浏览    高级检索

本文已被:浏览 179次   下载 95 本文二维码信息
码上扫一扫!
面向神经网络处理器的FFT算子设计与实现——以昇腾910为例
何嘉星,李艳文,李柔,翟平华,李阳,邹喜华,潘炜,闫连山
0
(西南交通大学 信息科学与技术学院,成都 611756)
摘要:
为实现神经网络处理器快速傅里叶变换(Fast Fourier Transform,FFT)算子,探讨了FFT算法在神经网络处理器中的高性能和高精度并行计算问题。以华为昇腾910(Ascend910)神经网络处理器为例,基于华为公司提出的神经网络计算架构(Compute Architecture for Neural Network,CANN)设计了缓存分片、高效转置、矢量化蝶形计算的FFT高性能计算方案,实现了半精度和单精度任意长度复数序列的FFT计算。实验结果表明,精度和性能在序列达到一定长度后都优于中央处理器(Central Processing Unit,CPU),与英伟达(NVIDIA)的统一计算设备架构快速傅里叶变换(Compute Unified Device Architecture Fast Fourier Transform,cuFFT)相比,在半精度数据的典型长度上性能和精度最多分别提升了16.5倍和48%。
关键词:  神经网络处理器  快速傅里叶变换  昇腾达芬奇架构  特定域架构  神经网络计算架构
DOI:10.20079/j.issn.1001-893x.240607004
基金项目:国家自然科学基金资助项目(62275222,61901397);中央高校基本科研业务费项目(2682020CX87);四川省科技计划(2020YJ0014);西南交通大学种子基金(2682021GF027)
Design and Implementation of FFT Operator for Neural Network Processor—Taking Ascend910 as an Example
HE Jiaxing,LI Yanwen,LI Rou,ZHAI Pinghua,LI Yang,ZOU Xihua,PAN Wei,YAN Lianshan
(School of Information Science and Technology,Southwest Jiaotong University,Chengdu 611756,China)
Abstract:
In order to realize the fast Fourier transform(FFT) operator of the neural network processor,the authors discuss the issues of high-performance and high-precision parallel computing of the FFT algorithm in the neural network processor.Taking Huawei Ascend910 neural network processor as an example,based on Huawei搒 compute architecture for neural network(CANN),a high-performance computing solution for the FFT with cache slicing,efficient transposition,and vectorized butterfly calculations is designed,and the FFT calculations for complex sequences of arbitrary length in half-precision and single-precision are realized.Experimental results show that the accuracy and performance outperform central processing unit(CPU) when the sequences reach a certain length.Furthermore,compared with NVIDIA compute unified device architecture fast Fourier transform(cuFFT),the proposed solution exhibits performance improvements of up to 16.5 times and 48% in typical lengths of half-precision data,in terms of performance and accuracy respectively.
Key words:  neural network processor  fast Fourier transform  Ascend Davinci architecture  domain specific architecture  compute architecture for neural networks
安全联盟站长平台