Cross-Modal Contrastive Representation Learning for Multimedia Retrieval with Noisy Supervision
DOI:
https://doi.org/10.4108/eetsis.10757Abstract
Cross-modal contrastive representation learning has shown great potential for multimedia retrieval tasks by aligning heterogeneous modalities into a shared embedding space. However, its performance often degrades severely in real-world scenarios where supervision signals are noisy, such as mislabeled cross-modal pairs or ambiguous annotations. To address this challenge, we propose Adaptive Noise-Robust Contrastive Learning (ANRCL), a novel framework designed to enhance cross-modal representation robustness under noisy supervision. Specifically, ANRCL introduces an adaptive noise-robust contrastive loss that jointly exploits cross-modal consistency and intra-modal coherence to dynamically reweight training samples according to their estimated reliability. This mechanism effectively suppresses the influence of noisy pairs while reinforcing the contribution of high-confidence pairs. Experimental results on multiple benchmark datasets demonstrate that ANRCL consistently outperforms state-of-the-art methods in noisy supervision settings, achieving significant improvements in retrieval accuracy and robustness without sacrificing computational efficiency.
References
[1] Wang, K., Yin, Q.,Wang,W.,Wu, S. and Wang, L. (2016) A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215.
[2] Baltrušaitis, T., Ahuja, C. and Morency, L.P. (2018) Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence 41(2): 423–443.
[3] Lin, C.C., Lin, K., Wang, L., Liu, Z. and Li, L. (2022) Cross-modal representation learning for zero-shot action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition: 19978–19988.
[4] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G. et al. (2021) Learning transferable visual models from natural language supervision. In International conference on machine learning (PmLR): 8748–8763.
[5] Yang, S., Cui, L., Wang, L. and Wang, T. (2024) Crossmodal contrastive learning for multimodal sentiment Andonian, A., Chen, S. and Hamid, R. (2022) Robust cross-modal representation learning with progressive self-distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: 16430– 16441.
[7] Zheng, S., Rao, J., Zhang, J., Zhou, L., Xie, J., Cohen, E., Lu, W. et al. (2024) Cross-modal graph contrastive learning with cellular images. Advanced Science 11(32): 2404845.
[8] Song, H., Kim, M., Park, D., Shin, Y. and Lee, J.G. (2022) Learning from noisy labels with deep neural networks: A survey. IEEE transactions on neural networks and learning systems 34(11): 8135–8153.
[9] Pei, J. (2025) F3: Fair federated learning framework with adaptive regularization. Knowledge-Based Systems 316: 113392.
[10] Pei, J., Frascolla, V., Al-Dulaimi, A., Liu,W., Aldhyani, T.H., Bashir, A.K. and Mumtaz, S. (2025) Distributed large models training optimization with real-time wireless channel feedback. IEEE Journal on Selected Areas in Communications : 1–1doi:10.1109/JSAC.2025.3640136.
[11] Pei, J., Li, J., Song, Z., Al Dabel, M.M., Alenazi, M.J., Zhang, S. and Bashir, A.K. (2025) Neuro-vaesymbolic dynamic traffic management. IEEE Transactions on Intelligent Transportation Systems .
[12] Chen, T., Kornblith, S., Norouzi, M. and Hinton, G. (2020) A simple framework for contrastive learning of visual representations. In International conference on machine learning (PmLR): 1597–1607.
[13] Liu, L., Yu, J., Wang, M., Li, X., Han, Y. and Wang, Y. (2025) Dna: A general dynamic neural network accelerator. IEEE Transactions on Computers .
[14] Pei, J., Xu, X., Wang, L., Al-Rubaye, S., Zhang, S. and Al-Dulaimi, A. (2026) Adaptive federated learning for future iov-oriented iot end-to-end network planning. IEEE Internet of Things Journal : 1– 1doi:10.1109/JIOT.2026.3670827.
[15] Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A. et al. (2020) Supervised contrastive learning. Advances in neural information processing systems 33: 18661–18673.
[16] Rolnick, D., Veit, A., Belongie, S. and Shavit, N. (2017) Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694 .
[17] Jiang, L., Zhou, Z., Leung, T., Li, L.J. and Fei-Fei, L. (2018) Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In International conference on machine learning (PMLR): 2304–2313.
[18] Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu,W., Tsang, I. et al. (2018) Co-teaching: Robust training of deep neural networks with extremely noisy labels. Advances in neural information processing systems 31.
[19] Lee, K.H., Chen, X., Hua, G., Hu, H. and He, X. (2018) Stacked cross attention for image-text matching. In Proceedings of the European conference on computer vision (ECCV): 201–216.
[20] Faghri, F., Fleet, D.J., Kiros, J.R. and Fidler, S. (2017) Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612
[21] Cui, Y., Jia, M., Lin, T.Y., Song, Y. and Belongie, S. (2019) Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition: 9268–9277.
[22] Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q. et al. (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning (PMLR): 4904–4916.
[23] Gao, T., Yao, X. and Chen, D. (2021) Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.
[24] Zeng, R., Ma, W., Wu, X., Liu, W. and Liu, J. (2024) Image–text cross-modal retrieval with instance contrastive embedding. Electronics 13(2): 300.
[25] Wen, K., Xia, J., Huang, Y., Li, L., Xu, J. and Shao, J. (2021) Cookie: Contrastive cross-modal knowledge sharing pre-training for vision-language representation. In Proceedings of the IEEE/CVF international conference on computer vision: 2208–2217.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Hui Zhi

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This is an open access article distributed under the terms of the CC BY-NC-SA 4.0, which permits copying, redistributing, remixing, transformation, and building upon the material in any medium so long as the original work is properly cited.