(50-2) 16 * << * >> * Русский * English * Содержание * Все выпуски

Effective extraction of textual data from document images using transformer architecture of deep neural networks
В.А. Выходцева¹, Г.В. Попова², Я.А. Вайс²

¹Kazakh-American Free University, 070000, Kazakhstan, Ust-Kamenogorsk, 76 M. Gorky Street;
²D. Serikbaev East Kazakhstan State Technical University, 070004, Kazakhstan, Ust-Kamenogorsk, 19 Serikbayev Street

Полный текст (PDF)

DOI: 10.18287/COJ1744

ID статьи: 1744

Аннотация:
In the context of modern digital document management, the automation of document processing, particularly in accounting, is a crucial factor in enhancing the efficiency of business processes. However, automated document processing encounters a range of specific challenges, both linguistic and structural characteristics of the data. Traditional text processing methods that rely on classical optical character recognition (OCR) algorithms do not provide sufficient accuracy in extracting data from document images, which limits their use in automated accounting systems. These challenges are particularly evident when processing documents with complex structures, specific element placement, and text content. This paper proposes a solution to this problem by applying a model based on a transformer neural network architecture, specifically adapted for working with document images. Within the scope of this study, the transformer model is trained on a dataset of accounting document images with varying element placements and text with Cyrillic characters. The focus on Cyrillic text is particularly relevant, as research in this area has predominantly concentrated on documents in English or other Latin-based scripts. This article includes the results of training evaluated through specialized performance metrics. As a result of the experiment, at the final stage of training the model, the confidence loss was 0.156, which indicates that the model effectively minimizes the prediction error. The obtained accuracy of 0.868 showed a relatively high accuracy of forecasts. The Recall value of 0.905 indicates that the model effectively identifies most of the positive examples. The indicator F1=0.886 reflects a good balance between accuracy and memorability. The accuracy of 0.96798 indicates that the model's predictions are highly accurate. The use of the transformer model significantly improves the accuracy of extracting key information, such as date, number, and organization name, from accounting documents containing Cyrillic text. The findings of this study affirm the potential of this method for implementation in automated accounting systems, contributing to enhanced efficiency and precision in processing accounting documents.

Ключевые слова:
attention mechanism, deep learning, document intelligence, neural network, optical character recognition, transformer.

Citation:
Vykhodtseva VA, Popova GV, Vais YA. Effective extraction of textual data from document images using transformer architecture of deep neural networks. Computer Optics 2026; 50(2): 1744. DOI: 10.18287/COJ1744.

References:

Yu L, Zhao X, Huang J, Hu H, Liu B. Research on machine learning with algorithms and development. J Theory Pract Eng Sci (JTPES) 2023; 3(12): 7-14. DOI: 10.53469/jtpes.2023.03(12).02.
Xu Y, Zhou Y, Sekula P, Ding L. Machine learning in construction: From shallow to deep learning. Dev Built Environ 2021; 6: 100045. DOI: 10.1016/j.dibe.2021.100045.
Kühl N, Goutier M, Hirt R, Satzger G. Machine learning in artificial intelligence: Towards a common understanding. Proc 52nd Hawaii Int Conf on System Sciences 2019: 5236-5245.
Cui L, Xu Y, Lv T, Wei F. Document AI: Benchmarks, models and applications. arXiv Preprint. 2021. Source: https://arxiv.org/abs/2111.08609. DOI: 10.48550/arXiv.2111.08609.
Chakkarwar V, Tamane S, Thombre A. A review on BERT and its implementation in various NLP tasks. In: Tamane S, Ghosh S, Deshmukh S, eds. Proceedings of the international conference on applications of machine intelligence and data analytics (ICAMIDA 2022). Atlantis Press; 2022: 112-121. DOI: 10.2991/978-94-6463-136-4_12.
Kameswari ChS, et al. An overview of vision transformers for image processing: A survey. Int J Adv Comput Sci Appl 2023; 14(8): 273-289. DOI: 10.14569/IJACSA.2023.0140830.
Pereira GA, Hussain M. A review of transformer-based models for computer vision tasks: Capturing global context and spatial relationships. arXiv Preprint. 2024. Source: https://arxiv.org/abs/2408.15178. DOI: 10.48550/arXiv.2408.15178.
Kastanas S, Tan S, He Y. Document AI: A comparative study of transformer-based, graph-based models, and convolutional neural networks for document layout analysis. arXiv Preprint. 2023. Source: https://arxiv.org/abs/2308.15517. DOI: 10.48550/arXiv.2308.15517.
Vaswani A, et al. Attention is all you need. In: von Luxburg U, Guyon I, Bengio S, Wallach H, Fergus R, eds. NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc; 2017: 6000-6010.
Gillioz A, Casas J, Mugellini E, Khaled OA. Overview of the transformer-based models for NLP tasks. In: Ganzha M, Maciaszek L, Paprzycki M, eds. Proceedings of the 2020 Federated Conference on Computer Science and Information Systems, September 6-9, 2020, Sofia, Bulgaria. New York City: Institute of Electrical and Electronics Engineers; 2020: 179-183. DOI: 10.15439/2020F20.
Shirahatti A, Rajpurohit V, Sannakki S. Transformer-based multi-head attention network for aspect-based sentiment classification. Indones J Electr Eng Comput Sci 2022; 26(1): 472-481. DOI: 10.11591/ijeecs.v26.i1.pp472-481.
Lakew SM, Cettolo M, Federico M. A comparison of transformer and recurrent neural networks on multilingual neural machine translation. In: Bender EM, Derczynski L, Isabelle P, eds. Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018. Association for Computational Linguistics; 2018: 641-652.
Islam S, et al. A comprehensive survey on applications of transformers for deep learning tasks. Expert Syst Appl 2024; 241: 122666. DOI: 10.1016/j.eswa.2023.122666.
Erkan A, Gungor T. Analysis of deep learning model combinations and tokenization approaches in sentiment classification. IEEE Access 2023; 11: 134951-134968. DOI: 10.1109/ACCESS.2023.3337354.
Kanjirangat V, Mellace S, Antonucci A. Temporal embeddings and transformer models for narrative text understanding. arXiv Preprint. 2020. Source: https://arxiv.org/abs/2003.08811. DOI: 10.48550/arXiv.2003.08811.
Sajun AR, Zualkernan I, Sankalpa D. A historical survey of advances in transformer architectures. Appl Sci 2024; 14(10): 4316. DOI: 10.3390/app14104316.
Xu P, Zhu X, Clifton DA. Multimodal learning with transformers: A survey. IEEE Trans Pattern Anal Mach Intell 2023; 45(10): 12113-12132. DOI: 10.1109/TPAMI.2023.3275156.
Hafiz AM, Parah SA, Bhat RUA. Attention mechanisms and deep learning for machine vision: A survey of the state of the art. arXiv Preprint. 2021. Source: https://arxiv.org/abs/2106.07550. DOI: 10.48550/arXiv.2106.07550.
Lin T, Wang Y, Liu X, Qiu X. A survey of transformers. AI Open 2022; 3: 111-132. DOI: 10.1016/j.aiopen.2022.10.001.
Bai G, Guo H, Xiao C. Research on the application of transformer in computer vision. J Phys Conf Ser 2023. DOI: 10.1088/1742-6596/2649/1/012033.
Papa L, Russo P, Amerini I, Zhou L. A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking. IEEE Trans Pattern Anal Mach Intell 2024; 46(12): 7682-7700. DOI: 10.1109/TPAMI.2024.3392941.
Gheini M, Ren X, May J. Cross-attention is all you need: Adapting pretrained transformers for machine translation. Proc 2021 Conf on Empirical Methods in Natural Language Processing 2021: 1754-1765. DOI: 10.18653/v1/2021.emnlp-main.132.
Chitty-Venkata KT, Emani M, Vishwanath V, Somani AK. Neural architecture search for transformers: A survey. IEEE Access 2022; 10: 108374-108412. DOI: 10.1109/ACCESS.2022.3212767.
Sonkar S, Baraniuk RG. Investigating the role of feed-forward networks in transformers using parallel attention and feed-forward net design. arXiv Preprint. 2023. Source: https://arxiv.org/abs/2305.13297. DOI: 10.48550/arXiv.2305.13297.
Wang J, Jin L, Ding K. LiLT: A simple yet effective language-independent layout transformer for structured document understanding. 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022) 2022: 7747-7757. DOI: 10.18653/v1/2022.acl-long.534.
Menary S, Kaski S, Freitas A. Transformer normalisation layers and the independence of semantic subspaces. arXiv Preprint. 2024. Source: https://arxiv.org/abs/2406.17837. DOI: 10.48550/arXiv.2406.17837.
Nguyen TQ, Salazar J. Transformers without tears: Improving the normalization of self-attention. 2019 16th International Workshop on Spoken Language Translation 2019: 1-9. DOI: 10.5281/zenodo.3525484.

Россия, 443001, Самара, ул. Молодогвардейская, 151; электронная почта: journal@computeroptics.ru; тел: +7 (846) 242-41-24 (ответственный секретарь), +7 (846) 332-56-22 (технический редактор), факс: +7 (846) 332-56-20

Effective extraction of textual data from document images using transformer architecture of deep neural networksВ.А. Выходцева1, Г.В. Попова2, Я.А. Вайс2

1Kazakh-American Free University, 070000, Kazakhstan, Ust-Kamenogorsk, 76 M. Gorky Street;2D. Serikbaev East Kazakhstan State Technical University, 070004, Kazakhstan, Ust-Kamenogorsk, 19 Serikbayev Street

Effective extraction of textual data from document images using transformer architecture of deep neural networks
В.А. Выходцева¹, Г.В. Попова², Я.А. Вайс²

¹Kazakh-American Free University, 070000, Kazakhstan, Ust-Kamenogorsk, 76 M. Gorky Street;
²D. Serikbaev East Kazakhstan State Technical University, 070004, Kazakhstan, Ust-Kamenogorsk, 19 Serikbayev Street