In March 2017, I joined the Probabilistic Numerics group where we tackle some of the core aspects underlying the field of Machine Learning from a probabilistic perspective.
A central component of Machine Learning is the training step which involves finding an optimal parameter configuration w.r.t. a loss function. Due to the computational complexity of optimizing the parameters, computationally cheaper 1st order methods (SGD) are often preferred over more accurate 2nd order methods (Newton, CG). I hope to bridge the trade-off between speed and accuracy by extracting and transferring important information for 2nd order methods to make them converge faster, thus reducing the overall computational cost.
Prior to joining the MPI, I finished my studies in Engineering Physics at Chalmers University of Technology in Göteborg, Sweden. The programme consisted of a B.Sc. in Physics and a M.Sc. in Complex Adaptive Systems, of which the first year was spent at TU Delft.
Pre-conditioning is a well-known concept that can significantly improve the convergence of optimization algorithms. For noise-free problems, where good pre-conditioners are not known a priori, iterative linear algebra methods offer one way to efficiently construct them. For the stochastic optimization problems that dominate contemporary machine learning, however, this approach is not readily available. We propose an iterative algorithm inspired by classic iterative linear solvers that uses a probabilistic model to actively infer a pre-conditioner in situations where Hessian-projections can only be constructed with strong Gaussian noise. The algorithm is empirically demonstrated to efficiently construct effective pre-conditioners for stochastic gradient descent and its variants. Experiments on problems of comparably low dimensionality show improved convergence. In very high-dimensional problems, such as those encountered in deep learning, the pre-conditioner effectively becomes an automatic learning-rate adaptation scheme, which we also empirically show to work well.
Solving symmetric positive definite linear problems is a fundamental computational task in machine learning. The exact solution, famously, is cubicly expensive in the size of the matrix. To alleviate this problem, several linear-time approximations, such as spectral and inducing-point methods, have been suggested and are now in wide use. These are low-rank approximations that choose the low-rank space a priori and do not refine it over time. While this allows linear cost in the data-set size, it also causes a finite, uncorrected approximation error. Authors from numerical linear algebra have explored ways to iteratively refine such low-rank approximations, at a cost of a small number of matrix-vector multiplications. This idea is particularly interesting in the many situations in machine learning where one has to solve a sequence of related symmetric positive definite linear problems. From the machine learning perspective, such deflation methods can be interpreted as transfer learning of a low-rank approximation across a time-series of numerical tasks. We study the use of such methods for our field. Our empirical results show that, on regression and classification problems of intermediate size, this approach can interpolate between low computational cost and numerical precision.
Our goal is to understand the principles of Perception, Action and Learning in autonomous systems that successfully interact with complex environments and to use this understanding to design future systems