Methods and Strategies for 3D Content Creation Based on 3D Native Methods

Shun Fang; Xing Feng; Yanna Lv

doi:10.4108/airo.5320

Authors

Shun Fang Peking University
Xing Feng Lumverse Inc.
Yanna Lv Lumverse Inc.

DOI:

https://doi.org/10.4108/airo.5320

Keywords:

3D Content Creation, Point-E, 3DGen, Shap-E, 3D Generation

Abstract

The present paper provides a comprehensive overview of three neural network models, namely Point·E, 3DGen, and Shap·E, with a focus on their overall processes, network structures, loss functions, as well as their strengths, weaknesses, and potential future research opportunities. Point·E, an efficient framework, generates 3D point clouds from complex text prompts, leveraging a text-to-image diffusion model followed by 3D point cloud creation. 3DGen, a novel architecture, integrates a Variational Autoencoder with a diffusion model to produce triplane features for conditional and unconditional 3D object generation. Shap·E, a conditional generative model, directly generates parameters of implicit functions, enabling the creation of textured meshes and neural radiance fields. While these models demonstrate significant advancements in 3D generation, areas for improvement include enhancing sample quality, optimizing computational efficiency, and handling more complex scenes. Future research could explore further integration of these models with other techniques and extend their capabilities to address these challenges.

Downloads

References

Nichol, A., Jun, H., Dhariwal, P., Mishkin, P. and Chen, M., 2022. Point-e: A system for generating 3d point clouds from complex prompts. arXiv:2212.08751, DOI: 10.48550/arXiv.2212.08751.

Gupta, A., Xiong, W., Nie, Y., Jones, I. and Oğuz, B., 2023. 3dgen: Triplane latent diffusion for textured mesh generation. arXiv:2303.05371, DOI: DOI:10.48550/arXiv.2303.05371.

Jun, H. and Nichol, A., 2023. Shap-e: Generating conditional 3d implicit functions. arXiv:2305.02463, DOI: DOI:10.48550/arXiv.2305.02463.

Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M. and Sutskever, I., 2021, July. Zero-shot text-to-image generation. In International Conference on Machine Learning (pp. 8821-8831). PMLR, DOI: 10.48550/arXiv.2102.12092.

Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Aittala, M., Aila, T., Laine, S., Catanzaro, B. and Karras, T., 2022. eDiffi-I: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv:2211.01324, DOI:10.48550/arXiv.2211.01324.

Feng, Z., Zhang, Z., Yu, X., Fang, Y., Li, L., Chen, X., Lu, Y., Liu, J., Yin, W., Feng, S. and Sun, Y., 2023. ERNIE-ViLG 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10135-10145), DOI:10.1109/CVPR52729.2023.00977.

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T. and Ho, J., 2022. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35, pp.36479-36494, DOI:10.48550/arXiv.2205.11487.

Yu, J., Xu, Y., Koh, J.Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A., Yang, Y., Ayan, B.K. and Hutchinson, B., 2022. Scaling autoregressive models for content-rich text-to-image generation. arXiv:2206.10789, 2(3), p.5, DOI:10.48550/arXiv.2206.10789.

Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D. and Taigman, Y., 2022, October. Make-a-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision (pp. 89-106). Cham: Springer Nature Switzerland, DOI:10.48550/arXiv.2203.13131.

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. and Chen, M., 2022. Hierarchical text-conditional image generation with clip latents. arXiv:2204.06125, 1(2), p.3, DOI:10.48550/arXiv.2204.06125.

Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I. and Chen, M., 2021. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv:2112.10741, DOI: 10.48550/arXiv.2112.10741.

Ding, M., Yang, Z., Hong, W., Zheng, W., Zhou, C., Yin, D., Lin, J., Zou, X., Shao, Z., Yang, H. and Tang, J., 2021. CogView: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34, pp.19822-19835, DOI: 10.48550/arXiv.2105.13290.

Sanghi, A., Fu, R., Liu, V., Willis, K.D., Shayani, H., Khasahmadi, A.H., Sridhar, S. and Ritchie, D., 2023. CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes from Natural Language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 18339-18348), DOI:10.1109/CVPR52729.2023.01759.

Sanghi, A., Chu, H., Lambourne, J.G., Wang, Y., Cheng, C.Y., Fumero, M. and Malekshan, K.R., 2022. Clip-forge: Towards zero-shot text-to-shape generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 18603-18613), DOI:10.1109/CVPR52688.2022.01805.

Lin, C.H., Gao, J., Tang, L., Takikawa, T., Zeng, X., Huang, X., Kreis, K., Fidler, S., Liu, M.Y. and Lin, T.Y., 2023. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 300-309), DOI:10.1109/CVPR52729.2023.00037.

Poole, B., Jain, A., Barron, J.T. and Mildenhall, B., 2022. Dreamfusion: Text-to-3d using 2d diffusion. arXiv:2209.14988, DOI:10.48550/arXiv.2209.14988.

Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P. and Poole, B., 2022. Zero-shot text-guided object generation with dream fields. 2022 IEEE. In CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 857-866), DOI:10.1109/CVPR52688.2022.00094.

Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., and Fleet, D. J. Video diffusion models. arXiv:2204.03458, 2022b, DOI:10.48550/arXiv.2204.03458.

Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O. and Parikh, D., 2022. Make-a-video: Text-to-video generation without text-video data. arXiv:2209.14792, DOI: 10.48550/arXiv.2209.14792.

Hong, W., Ding, M., Zheng, W., Liu, X. and Tang, J., 2022. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv:2205.15868, DOI:10.48550/arXiv.2205.15868.

Li, Y., Yang, G., Zhu, Y., Ding, X., & Gong, R. (2018). Probability model-based early merge mode decision for dependent views coding in 3D-HEVC. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 14(4), 1-15.

Yang, F., Chen, K., Yu, B., & Fang, D. (2014). A relaxed fixed point method for a mean curvature-based denoising model. Optimization Methods and Software, 29(2), 274-285.

Wang, F., Li, Z. S., & Liao, G. P. (2014). Multifractal detrended fluctuation analysis for image texture feature representation. International Journal of Pattern Recognition and Artificial Intelligence, 28(03), 1455005.

Yang, X., Lei, K., Peng, S., Cao, X., & Gao, X. (2018). Analytical expressions for the probability of false-alarm and decision threshold of Hadamard ratio detector in non-asymptotic scenarios. IEEE Communications Letters, 22(5), 1018-1021.

Li, Y., Yang, G., Zhu, Y., Ding, X., & Gong, R. (2018). Probability model-based early merge mode decision for dependent views coding in 3D-HEVC. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 14(4), 1-15.

Peng, C., & Liao, B. (2023). Heavy-head sampling for fast imitation learning of machine learning based combinatorial auction solver. Neural Processing Letters, 55(1), 631-644.

Jin, J. (2016). Multi-function current differencing cascaded transconductance amplifier (MCDCTA) and its application to current-mode multiphase sinusoidal oscillator. Wireless Personal Communications, 86, 367-383.

Lu, H., Jin, L., Luo, X., Liao, B., Guo, D., & Xiao, L. (2019). RNN for solving perturbed time-varying underdetermined linear system with double bound limits on residual errors and state variables. IEEE Transactions on Industrial Informatics, 15(11), 5931-5942.

Li, Z., Li, S., & Luo, X. (2021). An overview of calibration technology of industrial robots. IEEE/CAA Journal of Automatica Sinica, 8(1), 23-36.

Li, Z., Li, S., Bamasag, O. O., Alhothali, A., & Luo, X. (2022). Diversified regularization enhanced training for effective manipulator calibration. IEEE Transactions on Neural Networks and Learning Systems.

Khalid, N., Xie, T., Belilovsky, E., & Popa, T. (2023). Clip-mesh: Generating textured meshes from text using pretrained image-text models (Doctoral dissertation, Concordia University Montréal, Québec, Canada).

Park, D. H., Azadi, S., Liu, X., Darrell, T., & Rohrbach, A. (2021, June). Benchmark for compositional text-to-image synthesis. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).

Zhang, B., Nießner, M., & Wonka, P. (2022). 3dilg: Irregular latent grids for 3d generative modeling. Advances in Neural Information Processing Systems, 35, 21871-21885.

Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R. (2021). Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1), 99-106.