(47-1) 21 * << * >> * Russian * English * Content * All Issues
  
Many heads but one brain: FusionBrain – a single multimodal multitask architecture and a competition
 D.D. Bakshandaeva 1,4, D.V. Dimitrov 1,2,6, V.S. Arkhipkin 1, A.V. Shonenkov 2, M.S. Potanin 2, D.K. Karachev 2, A.V. Kuznetsov 1,2,3, A.D. Voronov 2, A.A. Petiushko 2, V.F. Davydova 1, E.V. Tutubalina 1,2,5
 1 Sber AI, 121170, Moscow, Russia, Kutuzovsky prospekt, 32, building 2;
    2 Artificial Intelligence Research Institute, 105064, Moscow, Russia, Nizhniy Susalnyy pereulok, 5;
     3 Samara National Research University, 443086, Samara, Russia, Moskovskoye Shosse, 34;
     4 University of Helsinki, 00014, Helsinki, Finland, Yliopistonkatu, 3;
     5 National Research University Higher School of Economics, 109028, Moscow, Russia, Pokrovsky Bulvar, 11;
     6 Moscow State University, 119991, Moscow, Russia, Kolmogorova, 1
 
 PDF, 1489 kB
  PDF, 1489 kB
DOI: 10.18287/ 2412-6179-CO-1220
Pages: 185-195.
Full text of article: Russian language.
 
Abstract:
Supporting the current trend in the AI community, we present the AI Journey 2021 Challenge called FusionBrain, the first competition which is targeted to make a universal architecture which could process different modalities (in this case, images, texts, and code) and solve multiple tasks for vision and language. The FusionBrain Challenge combines the following specific tasks: Code2code Translation, Handwritten Text recognition, Zero-shot Object Detection, and Visual Question Answering. We have created datasets for each task to test the participants' submissions on it. Moreover, we have collected and made publicly available a new handwritten dataset in both English and Russian, which consists of 94,128 pairs of images and texts. We also propose a multimodal and multitask architecture – a baseline solution, in the centre of which is a frozen foundation model and which has been trained in Fusion mode along with Single-task mode. The proposed Fusion approach proves to be competitive and more energy-efficient compared to the task-specific one.
Keywords:
multimodality, multitask, bilinguality, foundation models, FusionBrain challenge.
Citation:
  Bakshandaeva D, Dimitrov D, Arkhipkin V, Shonenkov A, Potanin M, Karachev D, Kuznetsov A, Voronov A, Petiushko A, Davydova V, Tutubalina E. Many heads but one brain: FusionBrain – a single multimodal multitask architecture and a competition. Computer Optics 2023; 47(1): 185-195. DOI: 10.18287/ 2412-6179-CO-1220.
Acknowledgements:
  We would like to thank Sber and SberCloud for granting the GPU-resources to us to experiment with different architectures and also to the participants to train their models, and for supporting the FusionBrain Challenge in general.
References:
  - Sludnova AA, Shutko VV, Gaidel AV, Pavel  Mikhailovich Zelter PM, Kapishnikov AV, Nikonorov AV. Identification of pathological changes in the lungs using an analysis of  radiological reports and tomographic images. Computer  Optics 2021; 45(2): 261-266. DOI: 10.18287/2412-6179-CO-793.
- Liu X, He P, W, Gao J.  Multi-task deep neural networks for natural language understanding. arXiv  preprint. 2019. Source: <https://arxiv.org/abs/1901.11504>. 
 
- Hu R, Singh A. Unit: Multimodal  multitask learning with a unified transformer. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2102.10772>.
 
- Liang PP, Liu Z, Zadeh AB,  Morency L-P. Multimodal language analysis with recurrent multistage fusion. Proc  2018 Conf on Empirical Methods in Natural Language Processing 2018: 150-161.
 
- Li LH, Yatskar M, Yin D, Hsieh  C-J, Chang K-W. Visualbert: A simple and performant baseline for vision and language.  arXiv preprint. 2019. Source:         <https://arxiv.org/abs/1908.03557>.
 
- Das A, Wahi JS, Li S. Detecting  hate speech in multi-modal memes. arXiv preprint. 2020. Source: <https://arxiv.org/abs/2012.14891>.
 
- Savchenko A, Alekseev A, Kwon  S, Tutubalina E, Myasnikov E, Nikolenko S. Ad lingua: Text classification improves  symbolism prediction in image advertisements. Proc 28th Int Conf on  Computational Linguistics 2020: 1886-1892. DOI:  10.18653/v1/2020.coling-main.171.
 
- Jaegle A, Gimeno F, Brock A,  Zisserman A, Vinyals O, Carreira J. Perceiver: General perception with  iterative attention. arXiv preprint. 2021. Source:         <https://arxiv.org/abs/2103.03206>.
 
- Jaegle A, Borgeaud S, Alayrac  J-B, et al. Perceiver io: A  general architecture for structured inputs & outputs. arXiv preprint. 2021.  Source: <https://arxiv.org/abs/2107.14795>.
 
- Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language  models are unsupervised multitask learners. Preprint. 2019. Source:         <https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>.
 
- Alayrac J-B, Donahue J, Luc P,  Miech A, Barr I, Hasson Y, Lenc K, Mensch A, Millican K, Reynolds M, Ring R,  Rutherford E, Cabi S, Han T, Gong Z, Samangooei S, Monteiro M, Menick J,  Borgeaud S, Brock A, Nematzadeh A, Sharifzadeh S, Binkowski M, Barreira R,  Vinyals O, Zisserman A, Simonyan K. Flamingo: a visual language model for  few-shot learning. arXiv preprint. 2022. Source: <https://arxiv.org/abs/2204.14198>.
 
- Hoffmann J, Borgeaud S, Mensch A, Buchatskaya E, Cai T, Rutherford  E, de Las Casas D, Hendricks LA, Welbl J, Clark A, Hennigan T, Noland E,  Millican K, van den Driessche G,     Damoc B, Guy A, Osindero S, Simonyan K, Elsen  E, Rae JW, Vinyals O, Sifre L. Training compute-optimal large language models.  arXiv preprint. 2022. Source: <https://arxiv.org/abs/2203.15556>.
 
- Nagrani A, Yang S, Arnab A, Jansen A, Schmid C, Sun C. Attention  bottlenecks for multimodal fusion. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2107.00135>.
 
- Komkov S, Dzabraev M, Petiushko A. Mutual modality learning for  video action classification. arXiv preprint. 2020. Source: <https://arxiv.org/abs/2011.02543>.
 
- Lu K, Grover A, Abbeel P, Mordatch I. Pretrained transformers as  universal computation engines. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2103.05247>.
 
- Houlsby  N, Giurgiu A, Jastrzebski S, Morrone B, De Laroussilhe Q, Gesmundo A, Attariyan  M, Gelly S. Parameter-efficient transfer learning for nlp. Proc 36th Int Conf on Machine Learning (PMLR '19)  2019: 2790-2799.
 
- Pfeiffer J, Kamath A, Rücklé A, Cho K, Gurevych I. Adapterfusion: Non-destructive task  composition for transfer learning. arXiv preprint. 2020. Source: <https://arxiv.org/abs/2005.00247>.
 
- Tay Y, Zhao Z, Bahri D, Metzler D, Juan D-C. Hypergrid transformers: Towards a single  model for multiple tasks. International Conference on Learning Representations  (ICLR 2021) 2021: 1-14. Source: <https://openreview.net/pdf?id=hiq1rHO8pNT>.
 
- Pilault J, Elhattami A, Pal C. Conditionally adaptive multi-task  learning: Improving transfer learning in nlp  using fewer parameters & less data. arXiv preprint. 2020. Source: <https://arxiv.org/abs/2009.09139>.
 
- Reed S, Zolna K, Parisotto E, Colmenarejo SG, Novikov A, Barth-Maron  G, Gimenez M, Sulsky Y, Kay J, Springenberg JT, Eccles T, Bruce J, Razavi A,  Edwards A, Heess N, Chen Y, Hadsell R, Vinyals O, Bordbar M, de Freitas N. A  generalist agent. arXiv preprint. 2022. Source: <https://arxiv.org/abs/2205.06175>.
 
- Maillard J, Karpukhin V, Petroni F, Yih W-t, Ouguz B, Stoyanov V,  Ghosh G. Multi-task retrieval for knowledge-intensive tasks. arXiv preprint. 2021.  Source:            <https://arxiv.org/abs/2101.00117>.
 
- Gabeur V, Sun C, Alahari K, Schmid C. Multi-modal transformer for  video retrieval. 16th European Conf on Computer Vision (ECCV 2020) 2020:  214-229.
 
- Dzabraev M, Kalashnikov M, Komkov S, Petiushko A. Mdmmt: Multidomain multimodal  transformer for video retrieval. Proc IEEE/CVF Conf on Computer Vision and  Pattern Recognition 2021: 3354-3363.
 
- Ahmad WU, Tushar MdGR, Chakraborty S, Chang K-W. AVATAR: A parallel  corpus for java-python program translation. arXiv preprint. 2021. Source:            <https://arxiv.org/abs/2108.11590>.
 
- Python Tokenizer. 2021. Source: <https://docs.python.org/3/library/tokenize.html>.
 
- Puri R, Kung DS, Janssen G, Zhang W, Domeniconi G, Zolotov V, Dolby  J, Chen J, Choudhury M, Decker L, Thost V, Buratti L, Pujar S, Ramji S, Finkler  U, Malaika S, Reiss F. Codenet: A  large-scale ai for code dataset for learning a diversity of coding tasks. arXiv  preprint. 2021. Source: <https://arxiv.org/abs/2105.12655>.
 
- Ren S, Guo D, Lu S, Zhou L, Liu S, Tang D, Sundaresan N, Zhou M,  Blanco A, Ma S. Codebleu: a  method for automatic evaluation of code synthesis. arXiv preprint. 2020.  Source: <https://arxiv.org/abs/2009.10297>.
 
- IAM handwriting database. 2021. Source: <https://fki.tic.heia-fr.ch/databases/iam-handwriting-database>.
 
- HTRdataset. 2021. Source: <https://github.com/sberbank-ai/htrdatasets>.
 
- Gu X, Lin T-Y, Kuo W, Cui Y. Open-vocabulary object detection via  vision and language knowledge distillation. arXiv preprint. 2021. Source:  <https://arxiv.org/abs/2104.13921>.
 
- Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S,  Kalantidis Y, Li L-J, Shamma DA, Bernstein MS, Li F-F. Visual genome:  Connecting language and vision using crowdsourced dense image annotations.  arXiv preprint. 2016. Source: <https://arxiv.org/abs/1602.07332>.
 
- Thomee B, Shamma DA, Friedland G, Elizalde B,  Ni K, Poland D, Borth D,  Li L-J. Yfcc100m: The new data in multimedia research. arXiv preprint.  2015. Source: <https://arxiv.org/abs/1503.01817>.
 
- FusionBrain Concept. 2021. Source: <https://colab.research.google.com/drive/1YAkxWG0dRKPtqy9CZxFPvCNCCXvMGr65?usp=sharing>.
 
- Bommasani R, Hudson DA, Adeli E, et al. On the opportunities and  risks of foundation models. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2108.07258>.
 
- Devlin J, Chang M-W, Lee K, Toutanova K. Bert: Pre-training of deep  bidirectional transformers for language understanding. arXiv preprint. 2019.  Source: <https://arxiv.org/abs/1810.04805>.
 
- Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O,  Stoyanov V, Zettlemoyer L. Bart:  Denoising sequence-to-sequence pre-training for natural language generation,  translation, and comprehension. arXiv preprint. 2019. Source: <https://arxiv.org/abs/1910.13461>.
 
- Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y,  Li W, Liu PJ. Exploring the limits of transfer learning with a unified  text-to-text transformer. arXiv preprint. 2020. Source: <https://arxiv.org/abs/1910.10683>.
 
- Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P,  Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger  G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M,  Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford  A, Sutskever I, Amodei D. Language models are few-shot learners. arXiv  preprint. 2020. Source: <https://arxiv.org/abs/2005.14165>.
 
- Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G,  Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable  visual models from natural language supervision. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2103.00020>.
 
- Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M,  Sutskever I. Zero-shot text-to-image generation. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2102.12092>.
 
- Rubinstein R, Davidson W. The cross-entropy method for combinatorial  and continuous optimization. Methodol Comput Appl Probab 1999; 1: 127-190.
 
- Graves  A, Fernández S, Gomez F, Schmidhuber J. Connectionist temporal classification:  Labelling unsegmented sequence data with recurrent neural nets. Proc 23rd Int  Conf on Machine learning (ICML'06) 2006: 369-376.
 
- Shonenkov A, Karachev D, Novopoltsev M, Potanin M, Dimitrov D. Stackmix and blot augmentations for handwritten  text recognition. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2108.11667>.
 
- de Buy Wenniger GM, Schomaker L, Way A. No padding please: Efficient  neural handwriting recognition. 2019 International Conference on Document  Analysis and Recognition (ICDAR), pages 355–362. IEEE, 2019.
 
- Johannes  Michael, Roger Labahn, Tobias Gru¨ning, and Jochen Zollner. Evaluating  sequence-to-sequence models for handwritten text recognition. 2019 Int Conf on  Document Analysis and Recognition (ICDAR) 2019: 1286-1293.
 
- Potanin M, Dimitrov D, Shonenkov A, Bataev V, Karachev D,  Novopoltsev M. Digital peter:  Dataset, competition and handwriting recognition methods. arXiv preprint. 2021.  Source: <https://arxiv.org/abs/2103.09354>.
 
- He K, Zhang X, Ren S, Sun J. Deep residual learning for image  recognition. arXiv preprint. 2015. Source: <https://arxiv.org/abs/1512.03385>.
 
- Kamath A, Singh M, LeCun Y, Synnaeve G, Misra I, Carion N. Mdetr – modulated detection for  end-to-end multi-modal understanding. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2104.12763>.
 
- Rezatofighi  H, Tsoi N, Gwak JY, Sadeghian A, Reid I, Savarese S. Generalized intersection  over union: A metric and a loss for bounding box regression. 2019 IEEE/CVF Conf  on Computer Vision and Pattern Recognition (CVPR) 2019: 658-666.
 
- Marti UV, Bunke H. The iam-database:  an english sentence database for  offline handwriting recognition. Int J Doc Anal Recognit 2002; 5: 39-46.
 
- Goyal  Y, Khot T, Summers-Stay D, Batra D, Parikh D. Making the V in VQA matter:  Elevating the role of image understanding in visual question answering. arXiv  preprint. 2016. Source: <https://arxiv.org/abs/1612.00837>.
 
- Ahmad WU, Chakraborty S, Ray B, Chang K-W. Unified pre-training for  program understanding and generation. arXiv preprint. 2021. Source:  <https://arxiv.org/abs/2103.06333>.
 
- Chaudhary K, Bali R. Easter2.0: Improving convolutional models for  handwritten text recognition. arXiv preprint. 2022. Source: <https://arxiv.org/abs/2205.14879?context=cs.AI>.
 
- Henderson P, Hu J, Romoff J, Emma B, Jurafsky D, Pineau J. Towards  the systematic reporting of the energy and carbon footprints of machine  learning. J Mach Learn Res 2020; 21(248): 1-43.
 
- Patterson D, Gonzalez J, Le Q, Liang C, Munguia L-M, Rothchild D, So  D, Texier M, Dean J. Carbon emissions and large neural network training. arXiv  preprint. 2021. Source: <https://arxiv.org/abs/2104.10350>.
 
- Lacoste A, Luccioni A, Schmidt V, Dandres T. Quantifying the carbon  emissions of machine learning. arXiv preprint. 2019. Source: <https://arxiv.org/abs/1910.09700>.
 
- Cowls J, Tsamados A, Taddeo M, Floridi L. The AI gambit – leveraging artificial intelligence to  combat climate change: Opportunities, challenges, and recommendations. AI Soc  2021; 18: 1-25.
 
- FusionBrain challenge. 2021. Source: <https://github.com/ai-forever/fusion_brain_aij2021>.
 
- DS Works. 2021. Source: <https://dsworks.ru/champ/fb5778a8-94e9-46de-8bad-aa2c83a755fb>.
 
- Cho J, Lei J, Tan H, Bansal M. Unifying vision-and-language tasks  via text generation. arXiv preprint. 2021. Source: <https://arxiv.org/abs/2102.02779>.
 
- Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H.  Training data-efficient image transformers amp; distillation through attention.  arXiv preprint. 2020. Source: <https://arxiv.org/abs/2012.12877>.
  
- Li M, Lv T, Chen J, Cui L, Lu Y, Florencio D, Zhang  C, Li Z, Wei F. Trocr:  Transformer-based optical character recognition with pre-trained models. arXiv  preprint. 2021. Source: <https://arxiv.org/abs/2109.10282>.
         
         
  
  © 2009, IPSI RAS
  151, Molodogvardeiskaya str., Samara, 443001, Russia; E-mail: journal@computeroptics.ru ; Tel: +7 (846) 242-41-24 (Executive secretary), +7 (846) 332-56-22 (Issuing editor), Fax: +7 (846) 332-56-20