CAD-guided 6D pose estimation with deep learning in digital twin for industrial collaborative robot manipulation

Authors

DOI:

https://doi.org/10.4108/airo.9676

Keywords:

pose estimation, computer vision, digital twin, industrial robot

Abstract

6D pose estimation in the bin-picking task has attracted increasing attention from researchers. CAD model-based method have been proposed, demonstrating its effectiveness. However, most existing research relies on point cloud registration from the RGB-D camera, which is often not robust to noise and low-light conditions, leading to degraded point cloud quality and reduced accuracy. Thereby, the method accuracy is significantly affected. Moreover, detecting objects correctly plays a vital role in multiple objects. Supervised deep learning takes consideration into this task, but it typically requires a large amount of labeled data. In industrial environments, sample collection and model retraining are limited. To address these challenges, we introduce the potential approach that integrates the zero-shot learning YOLOE and DEFOM-Stereo model. The YOLOE detects and localizes the object without requiring object-specific training, while DEFOM-Stereo generates point clouds for the CAD model-based pose estimation. Extensive experiments demonstrate that the proposed approach achieves high accuracy in pose estimation, which is essential for grasp planning and manipulation tasks in robotics. Furthermore, the proposed approach is applied in a Unity3D-based digital twin, enabling enhanced virtual representation of a physical pickup target with an estimated pose. Hence, the research result supports more accurate and responsive digital twins for robotics toward the development of smart manufacturing systems.

Downloads

Download data is not yet available.

References

[1] Zhou, X., Xu, X., Liang, W., Zeng, Z., Shimizu, S., Yang, L.T. and Jin, Q. (2022) Intelligent small object detection for digital twin in smart manufacturing with industrial cyber-physical systems. IEEE Transactions on Industrial Informatics 18(2): 1377–1386. doi:10.1109/TII.2021.3061419.

[2] Zhao, Z.Q., Zheng, P., Xu, S.T. and Wu, X. (2019) Object detection with deep learning: A review. IEEE Transactions on Neural Networks and Learning Systems 30(11): 3212–3232. doi:10.1109/TNNLS.2018.2876865.

[3] Liu, J., Sun, W., Yang, H., Zeng, Z., Liu, C., Zheng, J., Liu, X. et al. (2024) Deep learning-based object pose estimation: A comprehensive survey URL https://arxiv.org/abs/2405.07801. 2405.07801.

[4] Bochkovskiy, A., Wang, C.Y. and Liao, H.Y.M. (2020) Yolov4: Optimal speed and accuracy of object detection URL https://arxiv.org/abs/2004.10934. 2004.10934.

[5] Wang, A., Liu, L., Chen, H., Lin, Z., Han, J. and Ding, G. (2025) Yoloe: Real-time seeing anything URL https://arxiv.org/abs/2503.07465.2503.07465.

[6] Barricelli, B.R., Casiraghi, E. and Fogli, D. (2019) A survey on digital twin: Definitions, characteristics, applications, and design implications. IEEE Access 7: 167653–167671. doi:10.1109/ACCESS.2019.2953499.

[7] Vidal, J., Lin, C.Y., Llado, X. and Marti, R. (2018) A method for 6d pose estimation of free-form rigid objects using point pair features on range data. Sensors 18(8): 2678. doi:10.3390/s18082678.

[8] Bishop, C.M. and Nasrabadi, N.M. (2006) Pattern recognition and machine learning, 4 (Springer).

[9] Kotsiantis, S.B. (2007) Supervised machine learning: A review of classification techniques. Emerging artificial intelligence applications in computer engineering 160(1): 3–24.

[10] Ren, S., He, K., Girshick, R. and Sun, J. (2016), Faster r-cnn: Towards real-time object detection with region proposal networks. URL https://arxiv.org/abs/1506.01497.1506.01497.

[11] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y. and Berg, A.C. (2016) Ssd: Single shot multibox detector. Computer Vision – ECCV 2016 9905: 21–37. doi:10.1007/978-3-319-46448-0_2.

[12] Redmon, J., Divvala, S., Girshick, R. and Farhadi, A. (2016) You only look once: Unified, real-timeobject detection. In 2016 IEEE Conference on Computer and Pattern Recognition (CVPR): 779–788. doi:10.1109/CVPR.2016.91.

[13] Jocher, G. and Qiu, J. (2024), Ultralytics yolo11. URL https://github.com/ultralytics/ultralytics.

[14] Chang, J.R. and Chen, Y.S. (2018) Pyramid stereo matching network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 5410–5418. URL https://arxiv.org/abs/1803.08669.1803.08669.

[15] Zhang, F., Prisacariu, V., Yang, R. and Torr, P.H. (2019) Ga-net: Guided aggregation net for end-to-end stereo matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition: 185–194.

[16] Wen, B., Trepte, M., Aribido, J., Kautz, J., Gallo, O. and Birchfield, S. (2025) Foundationstereo: Zero-shot stereo matching. CVPR URL https://github.com/NVlabs/FoundationStereo.2501.09466.

[17] Jiang, H., Lou, Z., Ding, L., Xu, R., Tan, M., Jiang, W. and Huang, R. (2025) Defom-stereo: Depth foundation model based stereo matching. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Vogel-Heuser, B., Bi, F., Wittemer, M., Zhao, J., Mayr, A., Fleischer, M., Prinz, T. et al. (2023), Literature collection of digital twin definitions from various domains, https://mediatum.ub.tum.de/doc/1716587.

[19] Moshayedi, A.J. et al. (2023) Integrating virtual reality and robotic operation system (ros) for agv navigation. EAI Endorsed Transactions on AI and Robotics 2. doi:10.4108/airo.v2i1.3181.

[20] Durojaye, A. et al. (2023) Immersive horizons: exploring the transformative power of virtual reality across economic sectors. EAI Endorsed Transactions on AI and Robotics 2. doi:10.4108/airo.v2i1.3392.

[21] Durojaye, A., Kolahdooz, A. and Hajfathalian, A. (2025) Enhancing virtual reality experiences in architectural visualization of an academic environment. EAI Endorsed Transactions on AI and Robotics 4. doi:10.4108/airo.8051.

[22] Dimosthenopoulos, D., Basamakis, F.P., Glykos, C., Bavelos, A.C., Mountzouridis, G. and Makris, S. (2024) A digital twin-based paradigm for programming and control of cooperating robots in reconfigurable production systems. International Journal of Computer Integrated Manufacturing: 1–21. doi:10.1080/0951192X.2024.2428683.

[23] Sujatha, A., Kolahdooz, A., Jafari, M. and Hajfathalian, A. (2025) Simulation and control of the kuka kr6 900ex robot in unity 3d: Advancing industrial automation through virtual environments. EAI Endorsed Transactions on AI and Robotics 4. doi:10.4108/airo.8026.

[24] Moshayedi, A.J. et al. (2022) Deep learning application pros and cons over algorithm. EAI Endorsed Transactions on AI and Robotics 1(1). doi:10.4108/airo.v1i.19.

[25] Dong, Q.H., Nguyen, T.K., Tran, C.C., Pham, T.T., Do, D.T., Nguyen, H.V.K. and Nguyen, Q.C. (2025) Effectiveness of digital twin framework for collaborative robotic manipulation. Journal of Technical Education and Science In press.

[26] K. Noh, S. Ki Hong, S.M. and Lee, Y. (2024) Enhancing object detection in dense images: Adjustable non-maximum suppression for singleclass detection. IEEE Access 12: 30253–130263. doi:10.1109/ACCESS.2024.3459629.

[27] Ha, J. (2023) Probabilistic framework for hand–eye and robot–world calibration ax=yb. IEEE Transactions on Robotics 39(2): 1196–1211. doi:10.1109/TRO.2022.3214350.

[28] Wu, J., Liu, M., Zhu, Y., Zou, Z., Dai, M.Z., Zhang, C., Jiang, Y. et al. (2021) Globally optimal symbolic hand-eye calibration. IEEE/ASME Transactions on Mechatronics 26(3): 1369–1379. doi:10.1109/TMECH.2020.3019306.

[29] Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J. and Zisserman, A. (2010) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88(2): 303–338. doi:10.1007/s11263-009-0275-4.

[30] ZJU-IVI (2023), Rt-less_10parts: Reflective textureless dataset, https://github.com/ZJU-IVI/RT-Less_10parts.

Downloads

Published

02-10-2025

How to Cite

[1]
Q. H. Dong, “CAD-guided 6D pose estimation with deep learning in digital twin for industrial collaborative robot manipulation”, EAI Endorsed Trans AI Robotics, vol. 4, Oct. 2025.