(44-2) 19 * << * >> * Russian * English * Content * All Issues
Reducing computational costs in deep learning
on almost linearly separable training data
I.M. Kulikovsvkikh 1,2,3
1 Samara National Research University,
443086, Russia, Samara, Moskovskoe Shosse 34,
2 Faculty of Electrical Engineering and Computing, University of Zagreb,
10000, Croatia, Zagreb, Unska 3,
3 Rudjer Boskovic Institute, 10000, Croatia, Zagreb, Bijenicka cesta 54
PDF, 1294 kB
Full text of article: Russian language.
Previous research in deep learning indicates that iterations of the gradient descent, over separable data converge toward the L2 maximum margin solution. Even in the absence of explicit regularization, the decision boundary still changes even if the classification error on training is equal to zero. This feature of the so-called “implicit regularization” allows gradient methods to use more aggressive learning rates that result in substantial computational savings. However, even if the gradient descent method generalizes well, going toward the optimal solution, the rate of convergence to this solution is much slower than the rate of convergence of a loss function itself with a fixed step size. The present study puts forward the generalized logistic loss function that involves the optimization of hyperparameters, which results in a faster convergence rate while keeping the same regret bound as the gradient descent method. The results of computational experiments on MNIST and Fashion MNIST benchmark datasets for image classification proved the viability of the proposed approach to reducing computational costs and outlined directions for future research.
implicit regularization, gradient method, convergence rate, linear separability, image classification.
Kulikovskikh IM. Reducing computational costs in deep learning on almost linearly separable training data. Computer Optics 2020; 44(2): 282-289. DOI: 10.18287/2412-6179-CO-645.
This work was supported by the Russian Federation President's grant (Project No. MK-6218.2018.9), the Ministry of Education and Science of the Russian Federation (Project No. 074-U01), RFBR (Project No. 18-37-00219), and the Centre of Excellence project “DATACROSS”, co-financed by the Croatian Government and the European Union through the European Regional Development Fund - the Competitiveness and Cohesion Operational Programme (KK.01.1.1.01.0009).
LeCun Y, Bengio Y. Deep learning. Nature 2015; 521(7553): 436-444. DOI: 10.1038/nature14539.
- Goodfellow I, Bengio Y, Courville Y. Deep learning. Cambridge, London: The MIT Press; 2016. ISBN: 978-0-262-03561-3.
- Neyshabur B, Tomioka R, Srebro N. In search of the real inductive bias: On the role of implicit regularization in deep learning. Source: <https://arxiv.org/abs/1412.6614>.
- Soudry D, Hoffer E, Nacson MS, Gunasekar S, Srebro N. The implicit bias of gradient descent on separable data. J Mach Learn Res 2018; 19: 1-57.
- Zhang C, Bengio S, Recht M, Vinyals O. Understanding deep learning requires rethinking generalization. Source: <https://arxiv.org/abs/1611.03530>.
- Hoffer E, Hubara I, Soudry D. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. Source: <https://arxiv.org/abs/1705.08741>.
- Nacson MS, Lee J, Gunasekar S, Srebro N, Soudry D. Convergence of gradient descent on separable data. 2019 22nd International Conference on Artificial Intelligence and Statistics (AISTATS) 2019; PMLR 89: 3420-3428.
- Gunasekar S, Lee J, Soudry D, Srebro N. Characterizing implicit bias in terms of optimization geometry. 2018 35th International Conference on Machine Learning (ICML) 2018; PMLR 80: 1832-1841.
- Ma C, Wang K, Chi Y, Chen Y. Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval and matrix completion. 2018 35th International Conference on Machine Learning (ICML) 2018; PMLR 80: 3345-3354.
- Kingma DP, Ba JL. Adam: A method for stochastic optimization. Source: <https://arxiv.org/abs/1412.6980>.
- Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 2011; 12: 2121-2159.
- Zeiler MD. ADADELTA: An adaptive learning rate method. Source: <https://arxiv.org/abs/1212.5701>.
- Kim HS, Kang JH, Park WM, Ko SH, Cho YH, Yu DS, Song YS, Choi JW. Convergence analysis of optimization algorithms. Source: <https://arxiv.org/abs/1707.01647>.
- Ruder S. An overview of gradient descent optimization algorithms. Source: <https://arxiv.org/abs/1609.04747>..
- Wilson AC, Roelofs R, Stern M, Srebro N, Recht B. The marginal value of adaptive gradient methods in machine learning. 2017 31st Conference on Neural Information Processing Systems (NIPS) 2017: 1-11.
- Vorontsov KV. Mathematical methods for supervised learning (machine learning theory) [In Russian]. Source: <http:// www.machinelearning.ru/wiki/images/6/6d/Voron-ML-1.pdf>.
- Castaneda ARS, Torres ER, Goris NAV, González MM, Reyes JB, González VGS, et al. New formulation of the Gompertz equation to describe the kinetics of untreated tumors. PLoS ONE 2019; 14(11): e0224978.
- Kulikovskikh I, Prokhorov S, Lipić T, Legović T, Šmuc T. BioGD: Bio-inspired robust gradient descent. PLoS ONE 2019; 14(7): e0219004.
- Kulikovskikh I, Prokhorov S, Legović T, Šmuc T. An SGD-based meta-learner with “growing” descent. J Phys: Conf Ser 2019; 1368: 052008.
- Savchenko AV. Maximum-likelihood dissimilarities in image recognition with deep neural networks. Computer Optics 2017; 41(3): 422-430. DOI: 10.18287/2412-6179-2017-41-3-422-430.
- An S, Boussaid F, Bennamoun M. How can deep rectifier networks achieve linear separability and preserve distances? 2015 32nd International Conference on Machine Learning (ICML) 2015; PMLR 375: 514-523.
- Bergstra J, Bengio Y. Random search for hyperparameter optimization. J Mach Learn Res 2012; 13: 281-305.
© 2009, IPSI RAS
151, Molodogvardeiskaya str., Samara, 443001, Russia; E-mail: firstname.lastname@example.org ; Tel: +7 (846) 242-41-24 (Executive secretary), +7 (846) 332-56-22 (Issuing editor), Fax: +7 (846) 332-56-20