A Comprehensive Survey of Text Encoders for Text-to-Image Diffusion Models

Shun Fang

doi:10.4108/airo.5566

Authors

Shun Fang Peking University https://orcid.org/0009-0008-5764-716X

DOI:

https://doi.org/10.4108/airo.5566

Keywords:

NLP, CLIP, T5-XXL, BERT, Text Encoder

Abstract

In this comprehensive survey, we delve into the realm of text encoders for text-to-image diffusion models, focusing on the principles, challenges, and opportunities associated with these encoders. We explore the state-of-the-art models, including BERT, T5-XXL, and CLIP, that have revolutionized the way we approach language understanding and cross-modal interactions. These models, with their unique architectures and training techniques, enable remarkable capabilities in generating images from textual descriptions. However, they also face limitations and challenges, such as computational complexity and data scarcity. We discuss these issues and highlight potential opportunities for further research. By providing a comprehensive overview, this survey aims to contribute to the ongoing development of text-to-image diffusion models, enabling more accurate and efficient image generation from textual inputs.

Downloads

References

Dai, A.M. and Le, Q.V., 2015. Semi-supervised sequence learning. Advances in neural information processing systems, 28.

Howard, J. and Ruder, S., 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146.

Radford, A., Narasimhan, K., Salimans, T. and Sutskever, I., 2018. Improving language understanding by generative pre-training.

Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. and Liu, P.J., 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, Vol.21, No.140, pp.1-67.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A. and Agarwal, S., 2020. Language models are few-shot learners. Advances in neural information processing systems, 33, pp.1877-1901.

Lyu, Q., Apidianaki, M. and Callison-Burch, C., 2024. Towards faithful model explanation in nlp: A survey. Computational Linguistics, pp.1-70.

Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018a. Deep contextualized word representations. In NAACL.

Radford, A., Narasimhan, K., Salimans, T. and Sutskever, I., 2018. Improving language understanding with unsupervised learning.

Howard, J. and Ruder, S., 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I., 2017. Attention is all you need. Advances in neural information processing systems, 30.

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J. and Krueger, G., 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748-8763.

McCann, B., Keskar, N.S., Xiong, C. and Socher, R., 2018. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730.

Yu, A.W., Dohan, D., Luong, M.T., Zhao, R., Chen, K., Norouzi, M. and Le, Q.V., 2018. Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541.

Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S., 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp. 19-27.

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R., 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.

Williams, A., Nangia, N., & Bowman, S. R., 2017. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.

Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P., 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.

Collobert, R., & Weston, J., 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pp. 160-167.

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W. & Dean, J., 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., & Le, Q. V., 2019. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., & Stoyanov, V., 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi, A., Roesner, F., & Choi, Y., 2019. Defending against neural fake news. Advances in neural information processing systems, 32.

Shazeer, N., & Stern, M., 2018. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pp. 4596-4604.

Edunov, S., Ott, M., Auli, M., & Grangier, D., 2018. Understanding back-translation at scale. arXiv preprint arXiv:1808.09381.

Warstadt, A., Singh, A., & Bowman, S. R., 2019. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7, 625-641.

De Marneffe, M. C., Simons, M., & Tonhauser, J., 2019. The commitmentbank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, Vol. 23, No. 2, pp. 107-124.

Roemmele, M., Bejan, C. A., & Gordon, A. S., 2011. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series.

Koonce, B., & Koonce, B., 2021. ResNet 50. Convolutional neural networks with swift for tensorflow: image recognition and dataset categorization, 63-72.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., & Fei-Fei, L., 2015. Imagenet large scale visual recognition challenge. International journal of computer vision, 115, 211-252.

Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L., 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248-255.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., & Houlsby, N., 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., & Li, L. J., 2016. Yfcc100m: The new data in multimedia research. Communications of the ACM, Vol.59, No.2, 64-73.

Karkkainen, K., & Joo, J., 2021. Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 1548-1558.

Zhang, Y., Yin, Z., Li, Y., Yin, G., Yan, J., Shao, J., & Liu, Z., 2020. Celeba-spoof: Large-scale face anti-spoofing dataset with rich annotations. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, pp. 70-85.

Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S., 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp. 19-27.

Xia, Y., Sedova, A., de Araujo, P.H.L., Kougia, V., Nußbaumer, L. and Roth, B., 2024. Exploring prompts to elicit memorization in masked language model-based named entity recognition. arXiv preprint arXiv:2405.03004.

Hu, J. and Frank, M.C., 2024. Auxiliary task demands mask the capabilities of smaller language models. arXiv preprint arXiv:2404.02418.

Wiland, J., Ploner, M. and Akbik, A., 2024. BEAR: A Unified Framework for Evaluating Relational Knowledge in Causal and Masked Language Models. arXiv preprint arXiv:2404.04113.

Aguiar, M., Zweigenbaum, P. and Naderi, N., 2024. SEME at SemEval-2024 Task 2: Comparing Masked and Generative Language Models on Natural Language Inference for Clinical Trials. arXiv preprint arXiv:2404.03977.

Yu, J., Kim, S.U., Choi, J. and Choi, J.D., 2024. What is Your Favorite Gender, MLM? Gender Bias Evaluation in Multilingual Masked Language Models. arXiv preprint arXiv:2404.06621.

Thennal, D.K., Nathan, G. and Suchithra, M.S., 2024, May. Fisher Mask Nodes for Language Model Merging. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, pp. 7349-7355.

Labrak, Y., Bazoge, A., Daille, B., Rouvier, M. and Dufour, R., 2024. How Important Is Tokenization in French Medical Masked Language Models?. arXiv preprint arXiv:2402.15010.

Naguib, M., Tannier, X. and Névéol, A., 2024. Few shot clinical entity recognition in three languages: Masked language models outperform LLM prompting. arXiv preprint arXiv:2402.12801.

Zalkikar, R. and Chandra, K., 2024. Measuring Social Biases in Masked Language Models by Proxy of Prediction Quality. arXiv preprint arXiv:2402.13954.

Amirahmadi, A., Ohlsson, M., Etminani, K., Melander, O. and Björk, J., 2024. A Masked language model for multi-source EHR trajectories contextual representation learning. arXiv preprint arXiv:2402.06675.

Czinczoll, T., Hönes, C., Schall, M. and de Melo, G., 2024. NextLevelBERT: Investigating Masked Language Modeling with Higher-Level Representations for Long Documents. arXiv preprint arXiv:2402.17682.

Parra, I., 2024. UnMASKed: Quantifying Gender Biases in Masked Language Models through Linguistically Informed Job Market Prompts. arXiv preprint arXiv:2401.15798.

Toprak Kesgin, H. and Amasyali, M.F., 2024. Iterative Mask Filling: An Effective Text Augmentation Method Using Masked Language Modeling. arXiv e-prints, pp.arXiv-2401.

Liang, W. and Liang, Y., 2024. DrBERT: Unveiling the Potential of Masked Language Modeling Decoder in BERT pretraining. arXiv preprint arXiv:2401.15861.

Liu, Y., 2024. Robust Evaluation Measures for Evaluating Social Biases in Masked Language Models. arXiv preprint arXiv:2401.11601.

Velasco, A., Palacio, D.N., Rodriguez-Cardenas, D. and Poshyvanyk, D., 2024. Which Syntactic Capabilities Are Statistically Learned by Masked Language Models for Code?. arXiv preprint arXiv:2401.01512.

Jeong, M., Kim, M., Lee, J.Y. and Kim, N.S., 2024. Efficient parallel audio generation using group masked language modeling. arXiv preprint arXiv:2401.01099.

Bellamy, D.R., Kumar, B., Wang, C. and Beam, A., 2023. Labrador: Exploring the Limits of Masked Language Modeling for Laboratory Data. arXiv preprint arXiv:2312.11502.

Shi, B., Zhang, X., Kong, D., Wu, Y., Liu, Z., Lyu, H. and Huang, L., 2024, April. General phrase debiaser: Debiasing masked language models at a multi-token level. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6345-6349.

Chen, T., Pertsemlidis, S., Watson, R., Kavirayuni, V.S., Hsu, A., Vure, P., Pulugurta, R., Vincoff, S., Hong, L., Wang, T. and Yudistyra, V., 2023. PepMLM: Target Sequence-Conditioned Generation of Peptide Binders via Masked Language Modeling. arXiv preprint arXiv:: 2403.04187.

Zhou, Y., Camacho-Collados, J. and Bollegala, D., 2023. A Predictive Factor Analysis of Social Biases and Task-Performance in Pretrained Masked Language Models. arXiv preprint arXiv:2310.12936.

Luo, X., Yuan, Y., Chen, S., Zeng, N. and Wang, Z., 2020. Position-transitional particle swarm optimization-incorporated latent factor analysis. IEEE Transactions on Knowledge and Data Engineering, Vol.34, No.8, pp.3958-3970.

Chen, T., Li, S., Qiao, Y. and Luo, X., 2024. A Robust and Efficient Ensemble of Diversified Evolutionary Computing