A Survey of Data-Driven 2D Diffusion Models for Generating Images from Text

Shun Fang

doi:10.4108/airo.5453

Authors

Shun Fang Peking University

DOI:

https://doi.org/10.4108/airo.5453

Keywords:

2D Diffusion Model, DDPM, HighLDM, Imagen

Abstract

This paper explores recent advances in generative modeling, focusing on DDPMs, HighLDM, and Imagen. DDPMs utilize denoising score matching and iterative refinement to reverse diffusion processes, enhancing likelihood estimation and lossless compression capabilities. HighLDM breaks new ground with high-res image synthesis by conditioning latent diffusion on efficient autoencoders, excelling in tasks through latent space denoising with cross-attention for adaptability to diverse conditions. Imagen combines transformer-based language models with HD diffusion for cutting-edge text-to-image generation. It uses pre-trained language encoders to generate highly realistic and semantically coherent images, surpassing competitors based on FID scores and human evaluations in DrawBench and similar benchmarks. The review critically examines each model's methods, contributions, performance, and limitations, providing a comprehensive comparison of their theoretical underpinnings and practical implications. The aim is to inform future generative modeling research across various applications.

Downloads

References

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. and Ganguli, S., 2015, June. Deep unsupervised learning using nonequilibrium thermodynamics. In International confer-ence on machine learning (pp. 2256-2265). PMLR.

Ho, J., Jain, A. and Abbeel, P., 2020. Denoising diffusion probabilistic models. Advances in neural information pro-cessing systems, 33, pp.6840-6851.

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Er-mon, S. and Poole, B., 2020. Score-based generative mod-eling through stochastic differential equations. arXiv pre-print arXiv:2011.13456.

Chen, N., Zhang, Y., Zen, H., Weiss, R.J., Norouzi, M. and Chan, W., 2020. Wavegrad: Estimating gradients for wave-form generation. arXiv preprint arXiv:2009.00713.

Kingma, D., Salimans, T., Poole, B. and Ho, J., 2021. Vari-ational diffusion models. Advances in neural information processing systems, 34, pp.21696-21707.

Kong, Z., Ping, W., Huang, J., Zhao, K. and Catanzaro, B., 2020. Diffwave: A versatile diffusion model for audio syn-thesis. arXiv preprint arXiv:2009.09761.

Mittal, G., Engel, J., Hawthorne, C. and Simon, I., 2021. Symbolic music generation with diffusion models. arXiv preprint arXiv:2103.16091.

Dhariwal, P. and Nichol, A., 2021. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34, pp.8780-8794.

Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M. and Salimans, T., 2022. Cascaded diffusion models for high fi-delity image generation. Journal of Machine Learning Re-search, 23(47), pp.1-33.

Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D.J. and Norouzi, M., 2022. Image super-resolution via iterative re-finement. IEEE transactions on pattern analysis and ma-chine intelligence, 45(4), pp.4713-4726.

Ho, J., Jain, A. and Abbeel, P., 2020. Denoising diffusion probabilistic models. Advances in neural information pro-cessing systems, 33, pp.6840-6851.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P. and Om-mer, B., 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF confer-ence on computer vision and pattern recognition (pp. 10684-10695).

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T. and Ho, J., 2022. Photorealistic text-to-im-age diffusion models with deep language understanding. Advances in neural information processing systems, 35, pp.36479-36494.

Karras, T., Laine, S. and Aila, T., 2019. A style-based gen-erator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vi-sion and pattern recognition (pp. 4401-4410).

Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J. and Aila, T., 2020. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF con-ference on computer vision and pattern recognition (pp. 8110-8119).

Karras, T., Aila, T., Laine, S. and Lehtinen, J., 2017. Pro-gressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196.

Jahn, M., Rombach, R. and Ommer, B., 2021. High-resolu-tion complex scene synthesis with transformers. arXiv pre-print arXiv:2105.06458.

Park, T., Liu, M.Y., Wang, T.C. and Zhu, J.Y., 2019. Se-mantic image synthesis with spatially-adaptive normaliza-tion. In Proceedings of the IEEE/CVF conference on com-puter vision and pattern recognition (pp. 2337-2346).

Sylvain, T., Zhang, P., Bengio, Y., Hjelm, R.D. and Sharma, S., 2021, May. Object-centric image generation from lay-outs. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 35, No. 3, pp. 2647-2655).

Sun, W. and Wu, T., 2021. Learning layout and style recon-figurable gans for controllable image synthesis. IEEE trans-actions on pattern analysis and machine intelligence, 44(9), pp.5070-5087.

Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Rad-ford, A., Chen, M. and Sutskever, I., 2021, July. Zero-shot text-to-image generation. In International conference on machine learning (pp. 8821-8831). Pmlr.

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. and Chen, M., 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2), p.3.

Zhou, Y., Zhang, R., Chen, C., Li, C., Tensmeyer, C., Yu, T., Gu, J., Xu, J. and Sun, T., 2021. Lafite: Towards lan-guage-free training for text-to-image generation. arxiv 2021. arXiv preprint arXiv:2111.13792, 2.

Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I. and Chen, M., 2021. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741.

Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D. and Taigman, Y., 2022, October. Make-a-scene: Scene-based text-to-image generation with human priors. In Euro-pean Conference on Computer Vision (pp. 89-106). Cham: Springer Nature Switzerland.

Zhang, H., Koh, J.Y., Baldridge, J., Lee, H. and Yang, Y., 2021. Cross-modal contrastive learning for text-to-image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 833-842).

Ye, H., Yang, X., Takac, M., Sunderraman, R. and Ji, S., 2021. Improving text-to-image synthesis using contrastive learning. arXiv preprint arXiv:2107.02423.

Tao, M., Tang, H., Wu, S., Sebe, N., Jing, X.Y., Wu, F. and Bao, B., 2020. Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv preprint arXiv:2008.05865, 2(6).

Zhu, M., Pan, P., Chen, W. and Yang, Y., 2019. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE/CVF con-ference on computer vision and pattern recognition (pp. 5802-5810).

Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X. and He, X., 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1316-1324).