(47-1) 21 * << * >> * Russian * English * Content * All Issues

Many heads but one brain: FusionBrain – a single multimodal multitask architecture and a competition
D.D. Bakshandaeva^1,4, D.V. Dimitrov^1,2,6, V.S. Arkhipkin¹, A.V. Shonenkov², M.S. Potanin², D.K. Karachev², A.V. Kuznetsov^1,2,3, A.D. Voronov², A.A. Petiushko², V.F. Davydova¹, E.V. Tutubalina^1,2,5

¹Sber AI, 121170, Moscow, Russia, Kutuzovsky prospekt, 32, building 2;
²Artificial Intelligence Research Institute, 105064, Moscow, Russia, Nizhniy Susalnyy pereulok, 5;
³Samara National Research University, 443086, Samara, Russia, Moskovskoye Shosse, 34;
⁴University of Helsinki, 00014, Helsinki, Finland, Yliopistonkatu, 3;
⁵National Research University Higher School of Economics, 109028, Moscow, Russia, Pokrovsky Bulvar, 11;
⁶Moscow State University, 119991, Moscow, Russia, Kolmogorova, 1

PDF, 1489 kB

DOI: 10.18287/ 2412-6179-CO-1220

Pages: 185-195.

Full text of article: Russian language.

Abstract:
Supporting the current trend in the AI community, we present the AI Journey 2021 Challenge called FusionBrain, the first competition which is targeted to make a universal architecture which could process different modalities (in this case, images, texts, and code) and solve multiple tasks for vision and language. The FusionBrain Challenge combines the following specific tasks: Code2code Translation, Handwritten Text recognition, Zero-shot Object Detection, and Visual Question Answering. We have created datasets for each task to test the participants' submissions on it. Moreover, we have collected and made publicly available a new handwritten dataset in both English and Russian, which consists of 94,128 pairs of images and texts. We also propose a multimodal and multitask architecture – a baseline solution, in the centre of which is a frozen foundation model and which has been trained in Fusion mode along with Single-task mode. The proposed Fusion approach proves to be competitive and more energy-efficient compared to the task-specific one.

Keywords:
multimodality, multitask, bilinguality, foundation models, FusionBrain challenge.

Citation:
Bakshandaeva D, Dimitrov D, Arkhipkin V, Shonenkov A, Potanin M, Karachev D, Kuznetsov A, Voronov A, Petiushko A, Davydova V, Tutubalina E. Many heads but one brain: FusionBrain – a single multimodal multitask architecture and a competition. Computer Optics 2023; 47(1): 185-195. DOI: 10.18287/ 2412-6179-CO-1220.

Acknowledgements:
We would like to thank Sber and SberCloud for granting the GPU-resources to us to experiment with different architectures and also to the participants to train their models, and for supporting the FusionBrain Challenge in general.

References:

Sludnova AA, Shutko VV, Gaidel AV, Pavel Mikhailovich Zelter PM, Kapishnikov AV, Nikonorov AV. Identification of pathological changes in the lungs using an analysis of radiological reports and tomographic images. Computer Optics 2021; 45(2): 261-266. DOI: 10.18287/2412-6179-CO-793.
Liu X, He P, W, Gao J. Multi-task deep neural networks for natural language understanding. arXiv preprint. 2019. Source: <https://arxiv.org/abs/1901.11504>.
Hu R, Singh A. Unit: Multimodal multitask learning with a unified transformer. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2102.10772>.
Liang PP, Liu Z, Zadeh AB, Morency L-P. Multimodal language analysis with recurrent multistage fusion. Proc 2018 Conf on Empirical Methods in Natural Language Processing 2018: 150-161.
Li LH, Yatskar M, Yin D, Hsieh C-J, Chang K-W. Visualbert: A simple and performant baseline for vision and language. arXiv preprint. 2019. Source: <https://arxiv.org/abs/1908.03557>.
Das A, Wahi JS, Li S. Detecting hate speech in multi-modal memes. arXiv preprint. 2020. Source: <https://arxiv.org/abs/2012.14891>.
Savchenko A, Alekseev A, Kwon S, Tutubalina E, Myasnikov E, Nikolenko S. Ad lingua: Text classification improves symbolism prediction in image advertisements. Proc 28th Int Conf on Computational Linguistics 2020: 1886-1892. DOI: 10.18653/v1/2020.coling-main.171.
Jaegle A, Gimeno F, Brock A, Zisserman A, Vinyals O, Carreira J. Perceiver: General perception with iterative attention. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2103.03206>.
Jaegle A, Borgeaud S, Alayrac J-B, et al. Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2107.14795>.
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. Preprint. 2019. Source: <https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>.
Alayrac J-B, Donahue J, Luc P, Miech A, Barr I, Hasson Y, Lenc K, Mensch A, Millican K, Reynolds M, Ring R, Rutherford E, Cabi S, Han T, Gong Z, Samangooei S, Monteiro M, Menick J, Borgeaud S, Brock A, Nematzadeh A, Sharifzadeh S, Binkowski M, Barreira R, Vinyals O, Zisserman A, Simonyan K. Flamingo: a visual language model for few-shot learning. arXiv preprint. 2022. Source: <https://arxiv.org/abs/2204.14198>.
Hoffmann J, Borgeaud S, Mensch A, Buchatskaya E, Cai T, Rutherford E, de Las Casas D, Hendricks LA, Welbl J, Clark A, Hennigan T, Noland E, Millican K, van den Driessche G, Damoc B, Guy A, Osindero S, Simonyan K, Elsen E, Rae JW, Vinyals O, Sifre L. Training compute-optimal large language models. arXiv preprint. 2022. Source: <https://arxiv.org/abs/2203.15556>.
Nagrani A, Yang S, Arnab A, Jansen A, Schmid C, Sun C. Attention bottlenecks for multimodal fusion. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2107.00135>.
Komkov S, Dzabraev M, Petiushko A. Mutual modality learning for video action classification. arXiv preprint. 2020. Source: <https://arxiv.org/abs/2011.02543>.
Lu K, Grover A, Abbeel P, Mordatch I. Pretrained transformers as universal computation engines. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2103.05247>.
Houlsby N, Giurgiu A, Jastrzebski S, Morrone B, De Laroussilhe Q, Gesmundo A, Attariyan M, Gelly S. Parameter-efficient transfer learning for nlp. Proc 36th Int Conf on Machine Learning (PMLR '19) 2019: 2790-2799.
Pfeiffer J, Kamath A, Rücklé A, Cho K, Gurevych I. Adapterfusion: Non-destructive task composition for transfer learning. arXiv preprint. 2020. Source: <https://arxiv.org/abs/2005.00247>.
Tay Y, Zhao Z, Bahri D, Metzler D, Juan D-C. Hypergrid transformers: Towards a single model for multiple tasks. International Conference on Learning Representations (ICLR 2021) 2021: 1-14. Source: <https://openreview.net/pdf?id=hiq1rHO8pNT>.
Pilault J, Elhattami A, Pal C. Conditionally adaptive multi-task learning: Improving transfer learning in nlp using fewer parameters & less data. arXiv preprint. 2020. Source: <https://arxiv.org/abs/2009.09139>.
Reed S, Zolna K, Parisotto E, Colmenarejo SG, Novikov A, Barth-Maron G, Gimenez M, Sulsky Y, Kay J, Springenberg JT, Eccles T, Bruce J, Razavi A, Edwards A, Heess N, Chen Y, Hadsell R, Vinyals O, Bordbar M, de Freitas N. A generalist agent. arXiv preprint. 2022. Source: <https://arxiv.org/abs/2205.06175>.
Maillard J, Karpukhin V, Petroni F, Yih W-t, Ouguz B, Stoyanov V, Ghosh G. Multi-task retrieval for knowledge-intensive tasks. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2101.00117>.
Gabeur V, Sun C, Alahari K, Schmid C. Multi-modal transformer for video retrieval. 16th European Conf on Computer Vision (ECCV 2020) 2020: 214-229.
Dzabraev M, Kalashnikov M, Komkov S, Petiushko A. Mdmmt: Multidomain multimodal transformer for video retrieval. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition 2021: 3354-3363.
Ahmad WU, Tushar MdGR, Chakraborty S, Chang K-W. AVATAR: A parallel corpus for java-python program translation. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2108.11590>.
Python Tokenizer. 2021. Source: <https://docs.python.org/3/library/tokenize.html>.
Puri R, Kung DS, Janssen G, Zhang W, Domeniconi G, Zolotov V, Dolby J, Chen J, Choudhury M, Decker L, Thost V, Buratti L, Pujar S, Ramji S, Finkler U, Malaika S, Reiss F. Codenet: A large-scale ai for code dataset for learning a diversity of coding tasks. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2105.12655>.
Ren S, Guo D, Lu S, Zhou L, Liu S, Tang D, Sundaresan N, Zhou M, Blanco A, Ma S. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint. 2020. Source: <https://arxiv.org/abs/2009.10297>.
IAM handwriting database. 2021. Source: <https://fki.tic.heia-fr.ch/databases/iam-handwriting-database>.
HTRdataset. 2021. Source: <https://github.com/sberbank-ai/htrdatasets>.
Gu X, Lin T-Y, Kuo W, Cui Y. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2104.13921>.
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA, Bernstein MS, Li F-F. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint. 2016. Source: <https://arxiv.org/abs/1602.07332>.
Thomee B, Shamma DA, Friedland G, Elizalde B, Ni K, Poland D, Borth D, Li L-J. Yfcc100m: The new data in multimedia research. arXiv preprint. 2015. Source: <https://arxiv.org/abs/1503.01817>.
FusionBrain Concept. 2021. Source: <https://colab.research.google.com/drive/1YAkxWG0dRKPtqy9CZxFPvCNCCXvMGr65?usp=sharing>.
Bommasani R, Hudson DA, Adeli E, et al. On the opportunities and risks of foundation models. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2108.07258>.
Devlin J, Chang M-W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint. 2019. Source: <https://arxiv.org/abs/1810.04805>.
Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint. 2019. Source: <https://arxiv.org/abs/1910.13461>.
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint. 2020. Source: <https://arxiv.org/abs/1910.10683>.
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D. Language models are few-shot learners. arXiv preprint. 2020. Source: <https://arxiv.org/abs/2005.14165>.
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable visual models from natural language supervision. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2103.00020>.
Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M, Sutskever I. Zero-shot text-to-image generation. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2102.12092>.
Rubinstein R, Davidson W. The cross-entropy method for combinatorial and continuous optimization. Methodol Comput Appl Probab 1999; 1: 127-190.
Graves A, Fernández S, Gomez F, Schmidhuber J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural nets. Proc 23rd Int Conf on Machine learning (ICML'06) 2006: 369-376.
Shonenkov A, Karachev D, Novopoltsev M, Potanin M, Dimitrov D. Stackmix and blot augmentations for handwritten text recognition. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2108.11667>.
de Buy Wenniger GM, Schomaker L, Way A. No padding please: Efficient neural handwriting recognition. 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 355–362. IEEE, 2019.
Johannes Michael, Roger Labahn, Tobias Gru¨ning, and Jochen Zollner. Evaluating sequence-to-sequence models for handwritten text recognition. 2019 Int Conf on Document Analysis and Recognition (ICDAR) 2019: 1286-1293.
Potanin M, Dimitrov D, Shonenkov A, Bataev V, Karachev D, Novopoltsev M. Digital peter: Dataset, competition and handwriting recognition methods. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2103.09354>.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. arXiv preprint. 2015. Source: <https://arxiv.org/abs/1512.03385>.
Kamath A, Singh M, LeCun Y, Synnaeve G, Misra I, Carion N. Mdetr – modulated detection for end-to-end multi-modal understanding. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2104.12763>.
Rezatofighi H, Tsoi N, Gwak JY, Sadeghian A, Reid I, Savarese S. Generalized intersection over union: A metric and a loss for bounding box regression. 2019 IEEE/CVF Conf on Computer Vision and Pattern Recognition (CVPR) 2019: 658-666.
Marti UV, Bunke H. The iam-database: an english sentence database for offline handwriting recognition. Int J Doc Anal Recognit 2002; 5: 39-46.
Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. arXiv preprint. 2016. Source: <https://arxiv.org/abs/1612.00837>.
Ahmad WU, Chakraborty S, Ray B, Chang K-W. Unified pre-training for program understanding and generation. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2103.06333>.
Chaudhary K, Bali R. Easter2.0: Improving convolutional models for handwritten text recognition. arXiv preprint. 2022. Source: <https://arxiv.org/abs/2205.14879?context=cs.AI>.
Henderson P, Hu J, Romoff J, Emma B, Jurafsky D, Pineau J. Towards the systematic reporting of the energy and carbon footprints of machine learning. J Mach Learn Res 2020; 21(248): 1-43.
Patterson D, Gonzalez J, Le Q, Liang C, Munguia L-M, Rothchild D, So D, Texier M, Dean J. Carbon emissions and large neural network training. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2104.10350>.
Lacoste A, Luccioni A, Schmidt V, Dandres T. Quantifying the carbon emissions of machine learning. arXiv preprint. 2019. Source: <https://arxiv.org/abs/1910.09700>.
Cowls J, Tsamados A, Taddeo M, Floridi L. The AI gambit – leveraging artificial intelligence to combat climate change: Opportunities, challenges, and recommendations. AI Soc 2021; 18: 1-25.
FusionBrain challenge. 2021. Source: <https://github.com/ai-forever/fusion_brain_aij2021>.
DS Works. 2021. Source: <https://dsworks.ru/champ/fb5778a8-94e9-46de-8bad-aa2c83a755fb>.
Cho J, Lei J, Tan H, Bansal M. Unifying vision-and-language tasks via text generation. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2102.02779>.
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H. Training data-efficient image transformers amp; distillation through attention. arXiv preprint. 2020. Source: <https://arxiv.org/abs/2012.12877>.
Li M, Lv T, Chen J, Cui L, Lu Y, Florencio D, Zhang C, Li Z, Wei F. Trocr: Transformer-based optical character recognition with pre-trained models. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2109.10282>.

© 2009, IPSI RAS
151, Molodogvardeiskaya str., Samara, 443001, Russia; E-mail: journal@computeroptics.ru ; Tel: +7 (846) 242-41-24 (Executive secretary), +7 (846) 332-56-22 (Issuing editor), Fax: +7 (846) 332-56-20