(49-6) 42 * << * >> * Russian * English * Content * All Issues

Comparison of convolutional networks and transformers for generating 3D-models via binary space partitioning from a single object image
D.N. Gribanov¹, I.A. Kilbas¹, A.V. Mukhin¹, R.A. Paringer¹, A.V. Kupriyanov^1,2

¹Samara National Research University,
Moskovskoye Shosse 34, Samara, 443086, Russia;
²Image Processing Systems Institute, NRC "Kurchatov Institute",
Molodogvardeyskaya 151, Samara, 443001, Russia

PDF, 1311 kB

DOI: 10.18287/COJ1863

Pages: 1247-1252.

Full text of article: English language.

Abstract:
This study explores the use of transformer architecture as an image encoder model for the task of 3D mesh generation from a single image. Traditionally, models based on autoencoder architecture perform such tasks, where an encoder produces a latent representation that a decoder subsequently converts into a 3D model. When processing image-based input, the ResNet18 convolutional network is a commonly used encoder. In this paper, we investigate replacing the convolutional network with a transformer-based approach while using binary space partitioning (BSP) for 3D object generation. Our experiments demonstrate that a transformer-based architecture, specifically the Compact Convolutional Transformer (CCT), can achieve performance comparable to its convolutional counterpart and exceed it both in quantitative metrics and visual quality. The best CCT-based model achieves a Chamfer Distance (CD) of 1.59 and a Light Field Distance (LFD) of 3907, whereas the convolutional variant attains a CD of 1.64 and an LFD of 3981. The CCT-based model also demonstrates superior 3D reconstruction quality on test samples. Additionally, the transformer model requires four times fewer parameters to achieve these results, though computational resources are two times higher in terms of Multiply-Accumulate operations (MACs). These findings indicate that the transformer-based model is more parameter-efficient and can achieve superior results compared to traditional convolutional networks in single-view reconstruction tasks.

Keywords:
computer vision, 3D model, neural network, transformer, convolutional network, vector representation, latent vector.

Citation:
Gribanov DN, Kilbas IA, Mukhin AV, Paringer RA, Kupriyanov AV. Comparison of convolutional networks and transformers for generating 3D models via binary space partitioning from a single object image. Computer Optics 2025; 49(6): 1247-1252. DOI: 10.18287/COJ1863.

Acknowledgements:
The research was carried out within the state assignment theme FSSS-2023-0006.

References:

Li A, Zhu Z, Wei M. GenPC: Zero-shot Point Cloud Completion via 3D Generative Priors. arXiv Preprint. 2025. Source: <https://arxiv.org/abs/2502.19896>. DOI: 10.48550/arXiv.2502.19896.
Mo S, Xie E, Chu R, Hong L, Niessner M, Li Z. DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation. Neural Information Processing Systems 2023; 36: 67960-67971.
Li Y, Dou Y, Chen X, Ni B, Sun Y, Liu Y, Wang F. 3DQD: Generalized Deep 3D Shape Prior via Part-Discretized Diffusion Process, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023: 16784-16794. DOI: 10.1109/CVPR52729.2023.01610.
Gribanov D, Kilbas I, Mukhin A, Paringer R. Effect of Encoder Architectures on the Generation of Vector Representations for Modeling 3D Objects via the Space of Convex Sets. X International Conference on Information Technology and Nanotechnology (ITNT) 2024: 1-7. DOI: 10.1109/itnt60778.2024.10582346.
Feng Y, Tagliasacchi A, Zhang H. BSP-Net: Generating Compact Meshes via Binary Space Partitioning. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020: 42-51. DOI: 10.1109/CVPR42600.2020.00012.
Choy CB, Xu D, Gwak JY, Chen K, Savarese S. 3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction. European Conference on Computer Vision (ECCV) 2016: 628-644. DOI: 10.1007/978-3-319-46484-8_38.
Chen R, Yin X, Yang Y, Tong C. Multi-view Pixel2Mesh++: 3D reconstruction via Pixel2Mesh with more images. The Visual Computer. 2022; 39: 5153-5166. DOI: 10.1007/s00371-022-02651-7.
He K, Zhang X, Ren S, Sun J. Identity Mappings in Deep Residual Networks. European Conference on Computer Vision (ECCV) 2016: 630-645. DOI: 10.1007/978-3-319-46493-0_38
Mittal P, Cheng Y, Singh M, Tulsiani S. AutoSDF: Shape Priors for 3D Completion, Reconstruction and Generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022: 306-315. DOI: 10.1109/cvpr52688.2022.00040.
Hui KC, Li R, Hu J, Fu C. Neural Template: Topology-aware Reconstruction and Disentangled Generation of 3D Meshes. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022: 18551-18561. DOI: 10.1109/CVPR52688.2022.01802.
Gupta K, Chandraker M. Neural Mesh Flow: 3D Manifold Mesh Generation via Diffeomorphic Flows. Neural Information Processing Systems. 2020; 33: 1747-1758.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A, Kaiser Ł, Polosukhin I. Attention Is All You Need. Neural Information Processing Systems. 2017; 30: 5998-6008.
Koner R, Jain G, Jain P, Tresp V, Paul S. LookupViT: Compressing visual information to a limited number of tokens. European Conference on Computer Vision 2024: 322-337. DOI: 10.1007/978-3-031-73016-0_19.
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations 2021.
Hassani A, Walton S, Shah N, Abuduweili A, Li J, Shi H. Escaping the Big Data Paradigm with Compact Transformers. arXiv Preprint. 2021. Source: <https://arxiv.org/abs/2104.05704>. DOI: 10.48550/arXiv.2104.05704.
Chandra AA, Tünnermann L, Löfstedt T, Grätz R. Transformer-based deep learning for predicting protein properties in the life sciences. eLife. 2023; 12. DOI: 10.7554/elife.82819.
Zhou D, Kang B, Jin X, Yang L, Lian X, Jiang Z, Hou Q, Feng J. DeepViT: Towards Deeper Vision Transformer. arXiv Preprint. 2021. Source: <https://arxiv.org/abs/2103.11886>. DOI: 10.48550/arxiv.2103.11886.
Yuan L, Chen Y, Wang T, Yu W, Shi Y, Jiang Z, Tay F, Feng J, Yan S. Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet. IEEE/CVF International Conference on Computer Vision (ICCV) 2021: 538-547. DOI: 10.1109/ICCV48922.2021.00060.
Krizhevsky A. Learning Multiple Layers of Features from Tiny Images. 2009.
Chang AX, Funkhouser T, Guibas L. et al. ShapeNet: An Information-Rich 3D Model Repository. arXiv Preprint. 2015. Source: <https://arxiv.org/abs/1512.03012>. DOI: 10.48550/arXiv.1512.03012.
Chen D, Tian X, Shen Y, Ouhyoung M. On Visual Similarity Based 3D Model Retrieval. Computer Graphics Forum. 2003; 22(3): 223-232. DOI: 10.1111/1467-8659.00669.
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H. Training data-efficient image transformers & distillation through attention. International Conference on Machine Learning 2021. 139: 10347-10357.
Lee SH, Lee S, Song BC. Vision Transformer for Small-Size Datasets. arXiv Preprint. 2021. Source: <https://arxiv.org/abs/2112.13492>. DOI: 10.48550/arxiv.2112.13492.
Babukhin DV, Reutov AA, Sych DV. Study of image reconstruction efficiency in single-pixel imaging method using generative adversarial networks. Computer Optics 2025; 49(5): 818-825. DOI: 10.18287/2412-6179-CO-1526.
Reutov AA, Babukhin DV, Sych DV. Object classification using a single-pixel camera and neural networks. Computer Optics 2025; 49(3): 517-524. DOI: 10.18287/2412-6179-CO-1538.)
Ndukwe IK, Yunovidov D, Bahrami MR, Mazzara M, Olugbade TO. Quality inspection of fertilizer granules using computer vision - a review. Computer Optics 2025; 49(1): 84-94. DOI: 10.18287/2412-6179-CO-1458.
Shadrin D, Illarionova S, Kasatov R, Akimenkova M, Rudensky G, Erhan E. Weed detection on embedded systems using computer vision algorithms. Computer Optics 2025; 49(1): 103-111. DOI: 10.18287/2412-6179-CO-1454.
Suetin MN, Dementiev VE, Tashlinskii AG, Magdeev RG. Methodology for detecting and assessing the dynamics of defects in engineering structures by processing images from an unmanned aerial vehicle. Computer Optics 2024; 48(5): 762-771. DOI: 10.18287/2412-6179-CO-1438.
Belkin IV, Abrameko AA, Bezuglyi VD, Yudin DA. Localization of mobile robot in prior 3D LiDAR maps using stereo image sequence. Computer Optics 2024; 48(3): 406-417. DOI: 10.18287/2412-6179-CO-1369.
Zagitov A, Chebotareva E, Toschev A, Magid E. Comparative analysis of neural network models performance on low-power devices for a real-time object detection task. Computer Optics 2024; 48 (2): 242-252. DOI: 10.18287/2412-6179-CO-1343).

© 2009, IPSI RAS
151, Molodogvardeiskaya str., Samara, 443001, Russia; E-mail: journal@computeroptics.ru ; Tel: +7 (846) 242-41-24 (Executive secretary), +7 (846) 332-56-22 (Issuing editor), Fax: +7 (846) 332-56-20

Comparison of convolutional networks and transformers for generating 3D-models via binary space partitioning from a single object image D.N. Gribanov 1, I.A. Kilbas 1, A.V. Mukhin 1, R.A. Paringer 1, A.V. Kupriyanov 1,2

1 Samara National Research University, Moskovskoye Shosse 34, Samara, 443086, Russia; 2 Image Processing Systems Institute, NRC "Kurchatov Institute", Molodogvardeyskaya 151, Samara, 443001, Russia

Comparison of convolutional networks and transformers for generating 3D-models via binary space partitioning from a single object image
D.N. Gribanov¹, I.A. Kilbas¹, A.V. Mukhin¹, R.A. Paringer¹, A.V. Kupriyanov^1,2

¹Samara National Research University,
Moskovskoye Shosse 34, Samara, 443086, Russia;
²Image Processing Systems Institute, NRC "Kurchatov Institute",
Molodogvardeyskaya 151, Samara, 443001, Russia