A Review of Real-Time Semantic Segmentation Methods for 2D Data in the Context of Deep Learning

Meng Gao; Haifeng Sima

doi:10.4108/eetel.8433

Authors

Meng Gao Henan Polytechnic University
Haifeng Sima Henan Polytechnic University

DOI:

https://doi.org/10.4108/eetel.8433

Keywords:

Image Semantic Segmentation, Deep Learning, Fully Supervised Learning, 2D Data

Abstract

Semantic segmentation is a key research topic in the field of computer vision, aiming to assign each pixel to the corresponding category based on the semantic information in the image. This technology has significant application value in fields such as virtual reality and autonomous driving.With the rapid development of deep learning, particularly with the advent of FCN, image semantic segmentation has made substantial progress. Fully supervised learning, which trains deep learning models using labeled data, has demonstrated excellent performance in semantic segmentation tasks. This paper provides a comprehensive discussion and analysis of fully supervised semantic segmentation algorithms for 2D data in deep learning. First, it introduces the concept of semantic segmentation, its development, and its application scenarios. Next, it systematically reviews and categorizes current real-time semantic segmentation algorithms, analyzing the characteristics and limitations of each. Additionally, this paper presents a complete evaluation framework for real-time semantic segmentation, including relevant datasets and evaluation metrics. Based on this foundation, it identifies several challenges currently facing the field and suggests potential directions for future research. Through this summary and analysis, the paper aims to provide valuable insights for researchers conducting studies on image semantic segmentation.

References

[1] Otsu, N. (1979) A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics 9(1): 62–66.

[2] Meyer, F. and Beucher, S. (1990) Morphological segmentation. Journal of Visual Communication and Image Representation 1(1): 21–46.

[3] Adams, R. and Bischof, L. (1994) Seeded region growing. IEEE Transactions on Pattern Analysis and Machine Intelligence 16(6): 641–647.

[4] Kass, M., Witkin, A. and Terzopoulos, D. (1988) Snakes: Active contour models. International Journal of Computer Vision 1(4): 321–331.

[5] Boykov, Y. and Jolly, M.P. (2001) Interactive graph cuts for optimal boundary & region segmentation of objects in n-d images. Proceedings of IEEE International Conference on Computer Vision (ICCV) : 105–112.

[6] Lafferty, J., McCallum, A. and Pereira, F. (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the 18th International Conference on Machine Learning (ICML) : 282–289.

[7] Li, S.Z. (2009) Markov random field modeling in image analysis (Springer Science & Business Media).

[8] Long, J., Shelhamer, E. and Darrell, T. (2015) Fully convolutional networks for semantic segmentation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 3431–3440.

[9] Krizhevsky, A., Sutskever, I. and Hinton, G.E. (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25.

[10] Ronneberger, O., Fischer, P. and Brox, T. (2015) U-net: Convolutional networks for biomedical image segmentation. In Navab, N., Hornegger, J., Wells, W.M. and Frangi, A.F. [eds.] Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015 (Cham: Springer International Publishing): 234–241.

[11] Badrinarayanan, V., Kendall, A. and Cipolla, R.(2017) Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(12): 2481–2495.

[12] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K.P. and Yuille, A.L. (2014) Semantic image segmen-tation with deep convolutional nets and fully con-nected crfs. CoRR abs/1412.7062. URL https://api. semanticscholar.org/CorpusID:1996665.

[13] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K. and Yuille, A.L. (2018) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(4): 834–848.

[14] Chen, L., Papandreou, G., Schroff, F. and Adam, H. (2017) Rethinking atrous convolution for semantic image segmentation. CoRR abs/1706.05587. URL http: //arxiv.org/abs/1706.05587. 1706.05587.

[15] Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F. and Adam, H. (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision. URL https: //api.semanticscholar.org/CorpusID:3638670.

[16] Zhao, H., Shi, J., Qi, X., Wang, X. and Jia, J. (2017) Pyramid scene parsing network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 6230–6239. doi:10.1109/CVPR.2017.660.

[17] Lin, T., Dollár, P., Girshick, R.B., He, K., Hariharan, B. and Belongie, S.J. (2016) Feature pyramid networks for object detection. CoRR abs/1612.03144. URL http: //arxiv.org/abs/1612.03144. 1612.03144.

[18] Vaswani, A. (2017) Attention is all you need. Advances in Neural Information Processing Systems .

[19] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weis-senborn, D., Zhai, X., Unterthiner, T., Dehghani, M. et al. (2020) An image is worth 16x16 words: Transformers for image recognition at scale. CoRR abs/2010.11929. URL https://arxiv.org/abs/2010. 11929. 2010.11929.

[20] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y. et al. (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): 6877–6886. doi:10.1109/CVPR46437.2021.00681.

[21] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M. and Luo, P. (2021) Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems 34: 12077–12090.

[22] Xu, L., Bennamoun, M., Boussaid, F., Laga, H., Ouyang, W. and Xu, D. (2024) Mctformer+: Multi-class token transformer for weakly supervised semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 46(12): 8380–8395. doi:10.1109/TPAMI.2024.3404422.

[23] Paszke, A., Chaurasia, A., Kim, S. and Culurciello, E. (2016) Enet: A deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147 .

[24] Romera, E., Álvarez, J.M., Bergasa, L.M. and Arroyo, R. (2018) Erfnet: Efficient residual factorized convnet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems 19(1): 263–272. doi:10.1109/TITS.2017.2750080.

[25] Xu, Z., Wu, D., Yu, C., Chu, X., Sang, N. and Gao, C. (2024), Sctnet: Single-branch cnn with transformer semantic information for real-time segmentation. URL https://arxiv.org/abs/2312.17071. 2312.17071.

[26] Yu, C., Wang, J., Peng, C., Gao, C., Yu, G. and Sang, N. (2018) Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European conference on computer vision (ECCV): 325–341.

[27] Yu, C., Gao, C., Wang, J., Yu, G., Shen, C. and Sang, N. (2020) Bisenet V2: bilateral network with guided aggregation for real-time semantic segmentation. CoRR abs/2004.02147. URL https://arxiv.org/abs/2004. 02147. 2004.02147.

[28] Fan, M., Lai, S., Huang, J., Wei, X., Chai, Z., Luo, J. and Wei, X. (2021) Rethinking bisenet for real-time semantic segmentation. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): 9711–9720. doi:10.1109/CVPR46437.2021.00959.

[29] Poudel, R.P.K., Liwicki, S. and Cipolla, R. (2019) Fast-scnn: Fast semantic segmentation network. CoRR abs/1902.04502. URL http://arxiv.org/abs/1902. 04502. 1902.04502.

[30] Pan, H., Hong, Y., Sun, W. and Jia, Y. (2023) Deep dual-resolution networks for real-time and accurate semantic segmentation of traffic scenes. IEEE Transactions on Intelligent Transportation Systems 24(3): 3448–3460. doi:10.1109/TITS.2022.3228042.

[31] Wang, J., Gou, C., Wu, Q., Feng, H., Han, J., Ding, E. and Wang, J. (2022), Rtformer: Efficient design for real-time semantic segmentation with transformer. URL https://arxiv.org/abs/2210.07124. 2210.07124.

[32] Wan, Q., Huang, Z., Lu, J., Yu, G. and Zhang, L. (2024), Seaformer++: Squeeze-enhanced axial transformer for mobile visual recognition. URL https://arxiv.org/abs/2301.13156. 2301.13156.

[33] Zhao, H., Qi, X., Shen, X., Shi, J. and Jia, J. (2018) Icnet for real-time semantic segmentation on high-resolution images. In Proceedings of the European conference on computer vision (ECCV): 405–420.

[34] Mehta, S., Rastegari, M., Caspi, A., Shapiro, L. and Hajishirzi, H. (2018) Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation. In Proceedings of the european conference on computer vision (ECCV): 552–568.

[35] Li, H., Xiong, P., Fan, H. and Sun, J. (2019) Dfanet: Deep feature aggregation for real-time semantic seg-mentation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): 9514–9523. doi:10.1109/CVPR.2019.00975.

[36] Xu, J., Xiong, Z. and Bhattacharyya, S.P. (2023), Pidnet: A real-time semantic segmentation network inspired by pid controllers. URL https://arxiv.org/abs/2206. 02066. 2206.02066.

[37] Dong, B., Wang, P. and Wang, F. (2023), Head-free lightweight semantic segmentation with lin-ear transformer. URL https://arxiv.org/abs/2301. 04648. 2301.04648.

[38] Zhang, W., Huang, Z., Luo, G., Chen, T., Wang, X., Liu, W., Yu, G. et al. (2022), Topformer: Token pyramid transformer for mobile semantic segmentation. URL https://arxiv.org/abs/2204.05525. 2204.05525.

[39] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U. et al.(2016) The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition: 3213–3223.

[40] Brostow, G.J., Shotton, J., Fauqueur, J. and Cipolla, R. (2008) Segmentation and recognition using structure from motion point clouds. In ECCV (1): 44–57.

[41] Brostow, G.J., Fauqueur, J. and Cipolla, R. (2009) Semantic object classes in video: A high-definition ground truth database. Pattern recognition letters 30(2): 88–97.

[42] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A. and Torralba, A. (2017) Scene parsing through ade20k dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 5122–5130. doi:10.1109/CVPR.2017.544.

[43] Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Bar-riuso, A. and Torralba, A. (2018), Semantic understand-ing of scenes through the ade20k dataset. URL https: //arxiv.org/abs/1608.05442. 1608.05442.

[44] Caesar, H., Uijlings, J. and Ferrari, V. (2018) Coco-stuff: Thing and stuff classes in context. In Computer vision and pattern recognition (CVPR), 2018 IEEE conference on (IEEE).

[45] Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J. and Zisserman, A. (2010) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88(2): 303–338.