| 摘要: |
| 针对传统去噪扩散模型生成图像尺寸固定以及生成过程不能按需控制的问题,提出一种基于去噪扩散模型的多尺度图像条件生成模型。该模型的核心在于通过自适应多尺度全卷积网络,实现了对任意尺度图像的输入与生成。首先,改进现有去噪扩散模型框架,通过多尺度扩散实现层次化特征提取;然后,通过上/下采样以及多尺度自适应全卷积器,实现模型对单个图像由粗到精的多尺度学习;最后,基于预训练的CLIP(Contrastive Language-Image Pre-training)模型提出条件控制损失函数,进而指导模型通过文本内容控制图像的条件生成。实验结果表明,所提方法不仅能适应尺度变化,还可通过文本引导生成过程,在无条件生成和有条件生成2种场景下,相比当前典型算法,在多样性、生成质量、块状伪影等指标上均有显著提升,其中,Clipscore与神经图像评估(Neural Image Assessment,NIMA)评分分别提高了0.43与0.2。 |
| 关键词: 多尺度图像 条件生成 扩散去噪模型 CLIP模型 |
| DOI:10.20079/j.issn.1001-893x.241016003 |
|
| 基金项目:国家自然科学基金资助项目(62303433) |
|
| Multi-scale Image Conditional Generation Algorithm Based on Denoising Diffusion Models |
| XIANG Tao,ZHU Peipei,CHANG Xianghui,LIU Jie,LI Peng |
| (1.Southwest China Institute of Electronic Technology,Chengdu 610036,China;2.National Key Laboratory of Complex Aviation System Simulation,Chengdu 610036,China;3.School of Electronic and Information Engineering,Sichuan University,Chengdu 610065,China) |
| Abstract: |
| For the problem of fixed image size and inability to control the generation process on demand in traditional denoising diffusion models,a multi-scale image-conditioned generation model based on denoising diffusion models is proposed.The core of the model lies in using an adaptive multi-scale fully convolutional network to enable the input and generation of images at arbitrary scales.Firstly,the existing denoising diffusion model framework is improved and hierarchical feature extraction is achieved through multi-scale diffusion.Then,multi-scale learning from coarse to fine for a single image is achieved through upsampling/downsampling and multi-scale adaptive convolution.Finally,based on the pre-trained Contrastive Language-Image Pre-training(CLIP) model,a conditional control loss function is proposed to guide the model in controlling image generation through text context.The experimental results show that the proposed method can not only adapt to scale changes,but also control the generation process through text.Compared with typical algorithms,this approach significantly improves diversity,generation quality and blocky artifacts in both unconditional and conditional generation scenarios.Among them,the Clipscore and Neural Image Assessment(NIMA) scores are improved by 0.43 and 0.2,respectively. |
| Key words: multi-scale image conditional generation denoising diffusion model CLIP model |