Cross-Modal Contrastive Representation Learning for Multimedia Retrieval with Noisy Supervision

Hui Zhi

doi:10.4108/eetsis.10757

Authors

Hui Zhi Hubei University of Education

DOI:

https://doi.org/10.4108/eetsis.10757

Abstract

Cross-modal contrastive representation learning has shown great potential for multimedia retrieval tasks by aligning heterogeneous modalities into a shared embedding space. However, its performance often degrades severely in real-world scenarios where supervision signals are noisy, such as mislabeled cross-modal pairs or ambiguous annotations. To address this challenge, we propose Adaptive Noise-Robust Contrastive Learning (ANRCL), a novel framework designed to enhance cross-modal representation robustness under noisy supervision. Specifically, ANRCL introduces an adaptive noise-robust contrastive loss that jointly exploits cross-modal consistency and intra-modal coherence to dynamically reweight training samples according to their estimated reliability. This mechanism effectively suppresses the influence of noisy pairs while reinforcing the contribution of high-confidence pairs. Experimental results on multiple benchmark datasets demonstrate that ANRCL consistently outperforms state-of-the-art methods in noisy supervision settings, achieving significant improvements in retrieval accuracy and robustness without sacrificing computational efficiency.

References

[1] Wang, K., Yin, Q.,Wang,W.,Wu, S. and Wang, L. (2016) A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215.

[2] Baltrušaitis, T., Ahuja, C. and Morency, L.P. (2018) Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence 41(2): 423–443.

[3] Lin, C.C., Lin, K., Wang, L., Liu, Z. and Li, L. (2022) Cross-modal representation learning for zero-shot action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition: 19978–19988.

[4] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G. et al. (2021) Learning transferable visual models from natural language supervision. In International conference on machine learning (PmLR): 8748–8763.

[5] Yang, S., Cui, L., Wang, L. and Wang, T. (2024) Crossmodal contrastive learning for multimodal sentiment Andonian, A., Chen, S. and Hamid, R. (2022) Robust cross-modal representation learning with progressive self-distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: 16430– 16441.

[7] Zheng, S., Rao, J., Zhang, J., Zhou, L., Xie, J., Cohen, E., Lu, W. et al. (2024) Cross-modal graph contrastive learning with cellular images. Advanced Science 11(32): 2404845.

[8] Song, H., Kim, M., Park, D., Shin, Y. and Lee, J.G. (2022) Learning from noisy labels with deep neural networks: A survey. IEEE transactions on neural networks and learning systems 34(11): 8135–8153.

[9] Pei, J. (2025) F3: Fair federated learning framework with adaptive regularization. Knowledge-Based Systems 316: 113392.

[10] Pei, J., Frascolla, V., Al-Dulaimi, A., Liu,W., Aldhyani, T.H., Bashir, A.K. and Mumtaz, S. (2025) Distributed large models training optimization with real-time wireless channel feedback. IEEE Journal on Selected Areas in Communications : 1–1doi:10.1109/JSAC.2025.3640136.

[11] Pei, J., Li, J., Song, Z., Al Dabel, M.M., Alenazi, M.J., Zhang, S. and Bashir, A.K. (2025) Neuro-vaesymbolic dynamic traffic management. IEEE Transactions on Intelligent Transportation Systems .

[12] Chen, T., Kornblith, S., Norouzi, M. and Hinton, G. (2020) A simple framework for contrastive learning of visual representations. In International conference on machine learning (PmLR): 1597–1607.

[13] Liu, L., Yu, J., Wang, M., Li, X., Han, Y. and Wang, Y. (2025) Dna: A general dynamic neural network accelerator. IEEE Transactions on Computers .

[14] Pei, J., Xu, X., Wang, L., Al-Rubaye, S., Zhang, S. and Al-Dulaimi, A. (2026) Adaptive federated learning for future iov-oriented iot end-to-end network planning. IEEE Internet of Things Journal : 1– 1doi:10.1109/JIOT.2026.3670827.

[15] Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A. et al. (2020) Supervised contrastive learning. Advances in neural information processing systems 33: 18661–18673.

[16] Rolnick, D., Veit, A., Belongie, S. and Shavit, N. (2017) Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694 .

[17] Jiang, L., Zhou, Z., Leung, T., Li, L.J. and Fei-Fei, L. (2018) Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In International conference on machine learning (PMLR): 2304–2313.

[18] Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu,W., Tsang, I. et al. (2018) Co-teaching: Robust training of deep neural networks with extremely noisy labels. Advances in neural information processing systems 31.

[19] Lee, K.H., Chen, X., Hua, G., Hu, H. and He, X. (2018) Stacked cross attention for image-text matching. In Proceedings of the European conference on computer vision (ECCV): 201–216.

[20] Faghri, F., Fleet, D.J., Kiros, J.R. and Fidler, S. (2017) Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612

[21] Cui, Y., Jia, M., Lin, T.Y., Song, Y. and Belongie, S. (2019) Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition: 9268–9277.

[22] Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q. et al. (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning (PMLR): 4904–4916.

[23] Gao, T., Yao, X. and Chen, D. (2021) Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.

[24] Zeng, R., Ma, W., Wu, X., Liu, W. and Liu, J. (2024) Image–text cross-modal retrieval with instance contrastive embedding. Electronics 13(2): 300.

[25] Wen, K., Xia, J., Huang, Y., Li, L., Xu, J. and Shao, J. (2021) Cookie: Contrastive cross-modal knowledge sharing pre-training for vision-language representation. In Proceedings of the IEEE/CVF international conference on computer vision: 2208–2217.

Cross-Modal Contrastive Representation Learning for Multimedia Retrieval with Noisy Supervision

Authors

DOI:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

Make a Submission