Transformer-Guided Video Inpainting Algorithm Based  on Local Spatial-Temporal joint

Jing Wang; ZongJu Yang

doi:10.4108/eetel.3156

Authors

Jing Wang Henan Polytechnic University
ZongJu Yang Henan Polytechnic University

DOI:

https://doi.org/10.4108/eetel.3156

Keywords:

video inpainting algorithm, flow-guided, attention mechanism, spatial-temporal transformer, Deep Flow Network, video target removal

Abstract

INTRODUCTION: Video inpainting is a very important task in computer vision, and it’s a key component of various practical applications. It also plays an important role in video occlusion removal, traffic monitoring and old movie restoration technology. Video inpainting is to obtain reasonable content from the video sequence to fill the missing region, and maintain time continuity and spatial consistency.
OBJECTIVES: In previous studies, due to the complexity of the scene of video inpainting, there are often cases of fast motion of objects in the video or motion of background objects, which will lead to optical flow failure. So the current video inpainting algorithm hasn’t met the requirements of practical applications. In order to avoid the problem of optical flow failure, this paper proposes a transformer-guided video inpainting model based on local Spatial-temporal joint.
METHODS: First, considering the rich Spatial-temporal relationship between local flows, a Local Spatial-Temporal Joint Network (LSTN) including encoder, decoder and transformer module is designed to roughly inpaint the local corrupted frames, and the Deep Flow Network is used to calculate the local bidirectional corrupted flows. Then, the local corrupted optical flow map is input into the Local Flow Completion Network (LFCN) with pseudo 3D convolution and attention mechanism to obtain a complete set of bidirectional local optical flow maps. Finally, the roughly inpainted local frame and the complete bidirectional local optical flow map are sent to the Spatial-temporal transformer and the inpainted video frame is output.
RESULTS: Experiments show that the algorithm achieves high quality results in the video target removal task, and has a certain improvement in indicators compared with advanced technologies.
CONCLUSION: Transformer-Guided Video Inpainting Algorithm Based on Local Spatial-Temporal joint can obtain high-quality optical flow information and inpainted result video.

Captures

Readers: 2

-

see details

References

A. S. Al Saadi, "Review on deep neural networks of videoinpainting," in AIP Conference Proceedings, 2022, vol. 2398,no. 1: AIP Publishing LLC, p. 050017.

Y. Zhang, "Color Image Enhancement based on HVS and PCNN," SCIENCE CHINA Information Sciences, vol. 53, no.10, pp. 1963-1976, 2010.

C. Li, C. Guo, and C. C. Loy, "Learning to enhance low-lightimage via zero-reference deep curve estimation," IEEETransactions on Pattern Analysis and Machine Intelligence,vol. 44, no. 8, pp. 4225-4238, 2021.

C. Tian, L. Fei, W. Zheng, Y. Xu, W. Zuo, and C.-W. Lin,"Deep learning on image denoising: An overview," NeuralNetworks, vol. 131, pp. 251-275, 2020.

S. Fadnavis, "Image interpolation techniques in digital imageprocessing: an overview," International Journal ofEngineering Research and Applications, vol. 4, no. 10, pp. 70-73, 2014.

H. Ji, C. Liu, Z. Shen, and Y. Xu, "Robust video denoising using low rank matrix completion," in 2010 IEEE computer society conference on computer vision and pattern recognition, 2010: IEEE, pp. 1791-1798.

Y. Wexler, E. Shechtman, and M. Irani, "Space-time videocompletion," in Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and PatternRecognition, 2004. CVPR 2004., 2004, vol. 1: IEEE, pp. I-I.

J. Shayan, S. M. Abdullah, and S. Karamizadeh, "An overview of objectionable image detection," in 2015 InternationalSymposium on Technology Management and EmergingTechnologies (ISTMET), 2015: IEEE, pp. 396-400.

S. Yuheng and Y. Hao, "Image segmentation algorithms overview," arXiv preprint arXiv:1707.02051, 2017.

C. Guillemot and O. Le Meur, "Image inpainting: Overview and recent advances," IEEE signal processing magazine, vol.31, no. 1, pp. 127-144, 2013.

G. Emile-Male, "The restorer's bandbook of easel painting,"1976.

M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester, "Imageinpainting," in Proceedings of the 27th annual conference onComputer graphics and interactive techniques, 2000, pp. 417-424.

H. Tang, G. Geng, and M. zhou, "Application of digitalprocessing in relic image restoration design," Sensing andImaging, vol. 21, pp. 1-10, 2020.

S. Setty and U. Mudenagudi, "Region of interest-based 3Dinpainting of cultural heritage artifacts," Journal onComputing and Cultural Heritage (JOCCH), vol. 11, no. 2, pp.1-21, 2018.

Y. Hirohashi, K. Narioka, M. Suganuma, X. Liu, Y. Tamatsu,and T. Okatani, "Removal of image obstacles for vehicle-mounted surrounding monitoring cameras by real-time videoinpainting," in Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition Workshops, 2020,pp. 214-215.

J. Zhang, T. Fukuda, and N. Yabuki, "Automatic objectremoval with obstructed façades completion using semanticsegmentation and generative adversarial inpainting," IEEEAccess, vol. 9, pp. 117486-117495, 2021.

B. Bešić and A. Valada, "Dynamic object removal and spatio-temporal RGB-D inpainting via geometry-aware adversariallearning," IEEE Transactions on Intelligent Vehicles, vol. 7,no. 2, pp. 170-185, 2022.

Y.-L. Chang, Z. Yu Liu, and W. Hsu, "Vornet: Spatio-temporally consistent video inpainting for object removal," inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 0-0.

S. Yuan, Y. Chen, H. Huo, and L. Zhu, "Analysis and synthesis of traffic scenes from road image sequences," Sensors, vol. 20,no. 23, p. 6939, 2020.

J. Chen, S. Zhang, X. Chen, Q. Jiang, H. Huang, and C. Gu,"Learning Traffic as Videos: A Spatio-Temporal VAEApproach for Traffic Data Imputation," in Artificial NeuralNetworks and Machine Learning–ICANN 2021: 30thInternational Conference on Artificial Neural Networks,Bratislava, Slovakia, September 14–17, 2021, Proceedings,Part V 30, 2021: Springer, pp. 615-627.

W. An, X. Zhang, H. Wu, W. Zhang, Y. Du, and J. Sun, "LPIN:A Lightweight Progressive Inpainting Network for Improving the Robustness of Remote Sensing Images SceneClassification," Remote Sensing, vol. 14, no. 1, p. 53, 2021.

A. Kuznetsov and M. Gashnikov, "Remote sensing imageinpainting with generative adversarial networks," in 2020 8thInternational Symposium on Digital Forensics and Security(ISDFS), 2020: IEEE, pp. 1-6.

M. Bertalmio, L. Vese, G. Sapiro, and S. Osher, "Simultaneous structure and texture image inpainting," IEEE transactions onimage processing, vol. 12, no. 8, pp. 882-889, 2003.

S. Darabi, E. Shechtman, C. Barnes, D. B. Goldman, and P.Sen, "Image melding: Combining inconsistent images using patch-based synthesis," ACM Transactions on graphics (TOG), vol. 31, no. 4, pp. 1-10, 2012.

T. Shiratori, Y. Matsushita, X. Tang, and S. B. Kang, "Videocompletion by motion field transfer," in 2006 IEEE computer society conference on computer vision and pattern recognition(CVPR'06), 2006, vol. 1: IEEE, pp. 411-418.

Y. Wexler, E. Shechtman, and M. Irani, "Space-timecompletion of video," IEEE Transactions on pattern analysis and machine intelligence, vol. 29, no. 3, pp. 463-476, 2007.

M. Strobel, J. Diebold, and D. Cremers, "Flow and colorinpainting for video completion," in Pattern Recognition: 36th German Conference, GCPR 2014, Münster, Germany, September 2-5, 2014, Proceedings 36, 2014: Springer, pp. 293-304.

A. Criminisi, P. Pérez, and K. Toyama, "Region filling and object removal by exemplar-based image inpainting," IEEETransactions on image processing, vol. 13, no. 9, pp. 1200-1212, 2004.

A. Newson, A. Almansa, M. Fradet, Y. Gousseau, and P. Pérez, "Video inpainting of complex scenes," Siam journal onimaging sciences, vol. 7, no. 4, pp. 1993-2019, 2014.

C. Wang, H. Huang, X. Han, and J. Wang, "Video inpainting by jointly learning temporal structure and spatial details," inProceedings of the AAAI Conference on Artificial Intelligence,2019, vol. 33, no. 01, pp. 5232-5239.

D. Kim, S. Woo, J.-Y. Lee, and I. S. Kweon, "Deep videoinpainting," in Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition, 2019, pp. 5792-5801.

R. Xu, X. Li, B. Zhou, and C. C. Loy, "Deep flow-guided video inpainting," in Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition,2019, pp. 3723-3732.

R. Murase, Y. Zhang, and T. Okatani, "Video-rate videoinpainting," in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), 2019: IEEE, pp. 1553-1561.

Y. Zeng, J. Fu, and H. Chao, "Learning joint spatial-temporaltransformations for video inpainting," in EuropeanConference on Computer Vision, 2020: Springer, pp. 528-543.

C. Gao, A. Saraf, J.-B. Huang, and J. Kopf, "Flow-edge guided video completion," in European Conference on Computer Vision, 2020: Springer, pp. 713-729.

K. Zhang, J. Fu, and D. Liu, "Flow-guided transformer forvideo inpainting," in Computer Vision–ECCV 2022: 17thEuropean Conference, Tel Aviv, Israel, October 23–27, 2022,Proceedings, Part XVIII, 2022: Springer, pp. 74-90.

J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool, "The 2017 davis challenge onvideo object segmentation," arXiv preprint arXiv:1704.00675,2017.

N. Xu et al., "Youtube-vos: A large-scale video objectsegmentation benchmark," arXiv preprint arXiv:1809.03327,2018.

A. Dosovitskiy et al., "An image is worth 16x16 words:Transformers for image recognition at scale," arXiv preprintarXiv:2010.11929, 2020.

S. Bhojanapalli, A. Chakrabarti, D. Glasner, D. Li, T.Unterthiner, and A. Veit, "Understanding robustness oftransformers for image classification," in Proceedings of theIEEE/CVF international conference on computer vision, 2021,pp. 10231-10241.

H. Fan et al., "Multiscale vision transformers," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6824-6835.

H. Wu et al., "Cvt: Introducing convolutions to visiontransformers," in Proceedings of the IEEE/CVF InternationalConference on Computer Vision, 2021, pp. 22-31.

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S.Zagoruyko, "End-to-end object detection withtransformers," in Computer Vision–ECCV 2020: 16thEuropean Conference, Glasgow, UK, August 23–28, 2020,Proceedings, Part I 16, 2020: Springer, pp. 213-229.

I. Misra, R. Girdhar, and A. Joulin, "An end-to-end transformer model for 3d object detection," in Proceedings ofthe IEEE/CVF International Conference on Computer Vision,2021, pp. 2906-2917.

X. Wang et al., "Oadtr: Online action detection withtransformers," in Proceedings of the IEEE/CVF InternationalConference on Computer Vision, 2021, pp. 7565-7575.

H. Wang, Y. Zhu, H. Adam, A. Yuille, and L.-C. Chen, "Max-deeplab: End-to-end panoptic segmentation with masktransformers," in Proceedings of the IEEE/CVF conference oncomputer vision and pattern recognition, 2021, pp. 5463-5474.

K. Zhang, J. Fu, and D. Liu, "Inertia-guided flow completionand style fusion for video inpainting," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5982-5991.

J. Ren, Q. Zheng, Y. Zhao, X. Xu, and C. Li, "Dlformer:Discrete latent transformer for video inpainting," inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3511-3520.

L. Ke, Y.-W. Tai, and C.-K. Tang, "Occlusion-aware videoobject inpainting," in Proceedings of the IEEE/CVFInternational Conference on Computer Vision, 2021, pp.14468-14478.

Z. Wu, C. Sun, H. Xuan, K. Zhang, and Y. Yan, "Divide-and-Conquer Completion Network for Video Inpainting," IEEETransactions on Circuits and Systems for Video Technology,2022.

J. Kang, S. W. Oh, and S. J. Kim, "Error compensationframework for flow-guided video inpainting," in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv,Israel, October 23–27, 2022, Proceedings, Part XV, 2022:Springer, pp. 375-390.

J. Wang, "A review on extreme learning machine," MultimediaTools and Applications, Accessed on: 2021/05/22. doi:10.1007/s11042-021-11007-7 [Online]. Available:https://doi.org/10.1007/s11042-021-11007-7

J. Wang, "A Review of Deep Learning on Medical ImageAnalysis," Mobile Netw. Appl., vol. 26, no. 1, pp. 351-380, Feb 2021.

Z. Qiu, T. Yao, and T. Mei, "Learning spatio-temporalrepresentation with pseudo-3d residual networks," inproceedings of the IEEE International Conference onComputer Vision, 2017, pp. 5533-5541.

M. Jaderberg, K. Simonyan, and A. Zisserman, "Spatialtransformer networks," Advances in neural informationprocessing systems, vol. 28, 2015.

J. Hu, L. Shen, and G. Sun, "Squeeze-and-excitationnetworks," in Proceedings of the IEEE conference oncomputer vision and pattern recognition, 2018, pp. 7132-7141.

Y.-L. Chang, Z. Y. Liu, K.-Y. Lee, and W. Hsu, "Free-formvideo inpainting with 3d gated convolution and temporal patchgan," in Proceedings of the IEEE/CVF InternationalConference on Computer Vision, 2019, pp. 9066-9075.

Y.-L. Chang, Z. Y. Liu, K.-Y. Lee, and W. Hsu, "Learnablegated temporal shift module for deep video inpainting," arXivpreprint arXiv:1907.01131, 2019.

Y. Zeng, J. Fu, H. Chao, and B. Guo, "Learning pyramid-context encoder network for high-quality image inpainting," inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1486-1494.

J. Nilsson and T. Akenine-Möller, "Understanding ssim,"arXiv preprint arXiv:2006.13846, 2020.

Q. Huynh-Thu and M. Ghanbari, "Scope of validity of PSNR in image/video quality assessment," Electronics letters, vol. 44, no. 13, pp. 800-801, 2008.

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang,"The unreasonable effectiveness of deep features as aperceptual metric," in Proceedings of the IEEE conference oncomputer vision and pattern recognition, 2018, pp. 586-595.

Y. Zhang, "Smart detection on abnormal breasts in digitalmammography based on contrast-limited adaptive histogramequalization and chaotic adaptive real-coded biogeography-based optimization," Simulation, vol. 92, no. 9, pp. 873-885,September 12, 2016 2016.

Y. Zhang, "Feature Extraction of Brain MRI by StationaryWavelet Transform and its Applications," Journal ofBiological Systems, vol. 18, no. S, pp. 115-132, 2010.

S. Wang, "Detection of Alzheimer's Disease by Three-Dimensional Displacement Field Estimation in StructuralMagnetic Resonance Imaging," Journal of Alzheimer's Disease, vol. 50, no. 1, pp. 233-248, 2016.

S. Wang, "Dual-Tree Complex Wavelet Transform and TwinSupport Vector Machine for Pathological Brain Detection,"Applied Sciences, vol. 6, no. 6, 2016, Art no. 169.

S. Lee, S. W. Oh, D. Won, and S. J. Kim, "Copy-and-pastenetworks for deep video inpainting," in Proceedings of theIEEE/CVF International Conference on Computer Vision, 2019, pp. 4413-4421.

Z. Li, C.-Z. Lu, J. Qin, C.-L. Guo, and M.-M. Cheng,"Towards an end-to-end framework for flow-guided videoinpainting," in Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition, 2022, pp. 17562-17571.

S.-H. Wang, "DenseNet-201-Based Deep Neural Networkwith Composite Learning Factor and Precomputation forMultiple Sclerosis Classification," ACM Trans. MultimediaComput. Commun. Appl., vol. 16, no. 2s, p. Article 60, 2020.