Areas of research

The main directions of current research:

Distributed optimization

Recent theoretical advances in decentralized distributed convex optimization

Decentralized optimization over time-varying graphs

In particular, under conditions of similarity and with reduction of variance: Optimal Gradient Sliding and its Application to Distributed Optimization Under Similarity

Optimal Algorithms for Decentralized Stochastic Variational Inequalities

The projects that can be done in this area are described below (there are many more projects, only a few examples are given here):

Optimal estimates for Variational Inequalities (VN) in the architecture of Federated Learning (FL)

As a task, we strongly recommend dealing with this article: The Min-Max Complexity of Distributed Stochastic Convex Optimization with Intermittent Communication. Especially in terms of obtaining formulas (12) and (13). In fact, what we are talking about here is an accelerated method with butches, just with a different size of the butch (1 and the maximum possible, at which there is an effect from batching).

Actually, the task is to transfer the results of the article The Min-Max Complexity of Distributed Stochastic Convex Optimization with Intermittent Communication in terms of optimal methods to saddle problems and VN based on the method from Solving variational inequalities with stochastic mirror-prox algorithm and its patched version.

A more subtle task is to obtain a lower estimate in the FL architecture for VN, similar to how it was done in the article The Min-Max Complexity of Distributed Stochastic Convex Optimization with Intermittent Communication. In other architectures, this is done, for example, in Distributed saddle-point problems: lower bounds, near-optimal and robust algorithms.

In the case of saddle bilinear problems (with quadratic composites) and even a more general class of VN with a linear vector field, it seems that in the case of stochastic productions can have results like Is Local SGD Better than Minibatch SGD? (Section 3). It would be great to check it out.

Optimal estimates for convex optimization problems under interpolation conditions in FL architecture

In 2021, an article was published showing for the first time how accelerated methods get along with interpolation conditions in terms of batching:An Even More Optimal Stochastic Optimization Algorithm: Minibatching and Interpolation Learning

However, this has not yet been done in the FL architecture. Although it was done (optimal methods for smooth convex optimization problems in the FL architecture) without interpolation conditions: The Min-Max Complexity of Distributed Stochastic Convex Optimization with Intermittent Communication. The task is to combine the results of these two works.

Relevant article on the topic: Faster Convergence of Local SGD for Over-Parameterized Models

Reduction of variance, proximal shells, and similarity in a non-Euclidean proxy setup with an application to the problem of finding the Vaserstein barycenter

In Dmitry Kovalev's work Optimal Algorithms for Decentralized Stochastic Variational Inequalities from 2022, an optimal decentralized algorithm for solving unconditional saddle problems (strongly convex-concave) with variance reduction was proposed. As you know, the task of finding the Vaserstein barycenter (Decentralized Distributed Optimization for Saddle Point Problems, as well as Numerical Methods for Large-Scale Optimal Transport) can be reduced to just a saddle problem, and solving such a task is conveniently distributed (including decentralized), due to high computational costs. Moreover, on one node, of course, to store several images (components of the original large sum).

Thus, we come to the saddle convex-concave problem of decentralized optimization of the sum type, in which a function is stored in each node, in turn, having the form of a sum of functions. Decentralized algorithms for solving such problems, taking into account the structure of the sum of the function stored in each node, provided that it is possible to access each of the terms of this sum separately (for example, it is possible to calculate the gradient of each term), are called decentralized methods with variance reduction. The results of the article by Kovalev and others are not suitable for the task of Calculating the Vaserstein barycenter (CtVB) for several reasons at once:

1) the CtVB problem in the saddle representation does not have a strong convexity-concavity (there is just a convexity-concavity, but not strong);

2) the CtVB task has limitations in the saddle representation;

3) the scheme of the article by Kovalev et al. assumes that the Euclidean norm is chosen by max and min variables in the optimal method. In the CtVB problem (as follows from the work of Rogozin and others), it is advantageous to choose a 1-norm for a group of min variables (since there are simplex constraints on min variables).

It is planned to develop the results of the article by Kovalev and others into a more general class of convex-concave optimization problems with constraints and not necessarily Euclidean prox structures, so that these results can be applied to the problem of CtVB in the saddle representation.

The reduction of variance in a non-Euclidean proxy setup (but unallocated) was previously worked out in the work Stochastic Variance Reduction for Variational Inequality Methods for saddle problems, and in the material A unified variance-reduced accelerated gradient method for convex optimization for normal optimization.

In the work of Alexander Beznosikov Distributed Saddle-Point Problems Under Data Similarity (Optimal Gradient Sliding and its Application to Distributed Optimization Under Similarity — a variant without an extra logarithm, however, in the centralized case), an effective distributed (including decentralized) method for solving unconditional strongly convex-concave saddle problems of the sum type was proposed, in which the terms are close functions. As already noted above, in the problem of BBB in the saddle representation, the terms have the form of a sum of functions. If the CtVB task came from data analysis and each picture/measure (whose barycenter is being searched for) is independently sampled from the same distribution, then each node stores the sum of statistically independent functions sampled from the same distribution. This means that the functions stored in the nodes are statistically close (the degree of proximity is inversely proportional to the root of the number of terms stored in the node).

Thus, it is possible to beat the structure of the problem differently and instead of reducing variance (which saves oracle calls in nodes as much as possible, that is, the number of calculations of the gradients of the terms in each node, but does not save the number of communications between nodes), use the similarity of functions stored in nodes and significantly save in the number of communications. Unfortunately, the direct application of the results of the article by Beznosikov et al. to the problem of calculating the BBB in the saddle representation does not work for exactly the same three reasons as were indicated above. It is planned to develop the results of the article by Beznosikov and others to a more general class of convex-concave optimization problems with constraints and not necessarily Euclidean proxy structures, so that these results can be applied to the problem of CtVB in the saddle representation.

In both samples, the most difficult place seems to be generalization to non-Euclidean proxy structures and the presence of constraints. All this will require significantly new ideas and excellent technology. The fact is that in the articles of Kovalev, Beznosikov and others, the Euclidean norm was used significantly. The reasoning of these articles will no longer pass for the case of a general norm.

Tensor methods of their application

Exploiting higher-order derivatives in convex optimization methods

In this direction, it is planned to develop the idea that the problems of quadratic stochastic optimization in the architecture of federated learning can be solved very effectively in the sense of the required number of communications. And for tensor methods, the auxiliary problem that needs to be solved at each iteration is similar to the problem that needs to be solved at the iteration of Newton's method, therefore, it is quadratic. Thus, there is a hope to reduce the number of communications when solving stochastic distributed optimization problems by tensor methods.

In 2022, optimal tensor methods (without extra logarithms in complexity estimates) were obtained in this area by laboratory staff: The first optimal acceleration of high-order methods in smooth convex optimization. However, many questions related to this approach remain open. Is such a method directly dual? Is it possible to build a probated stochastic variant based on it, etc.

Gradient-free optimization methods

An accelerated method for derivative-free smooth stochastic convex optimization

One-Point Gradient-Free Methods for Smooth and Non-Smooth Saddle-Point Problems

One-Point Feedback for Composite Optimization with Applications to Distributed and Federated Learning

The Power of First-Order Smooth Optimization for Black-Box Non-Smooth Problems

Gradient-Free Optimization for Non-Smooth Saddle Point Problems under Adversarial Noise

Randomized gradient-free methods in convex optimization

Recent research in this area focuses not only on obtaining methods with optimal estimates of oracle calls, but also methods with optimal estimates of the number of consecutive iterations and methods that are most robust to the level of non-random noise (hostile) in the value of the function. Often these are the same methods, optimal for all criteria at once.

Despite the fact that there has been quite a lot of recent progress here, there are still questions to which the answers are not known. For example, for smooth problems, lower estimates for the maximum permissible level of (hostile) noise are unknown, at which it is possible to solve the problem with a given accuracy. Also unknown in the general case are estimates of the probabilities of large deviations for stochastic optimization problems with sub-Gaussian noise, but not Euclidean proxy setup. And taking into account the fact that the methods of forming a stochastic gradient model based on the calculated values of the function can be different (Gradient-free prox methods with an imprecise oracle for nonsmooth convex stochastic optimization problems on a simplex, A gradient estimator via L1-randomization for online zero-order optimization with two point feedback, Randomized gradient-free methods in convex optimization), additionally there is the task of studying different randomizations in order to identify the best according to the criteria listed above.

Structural minimization

Accelerated methods for saddle tasks

On Accelerated Methods for Saddle-Point Problems with Composite Structure

Accelerated Primal-Dual Gradient Method for Smooth and Convex-Concave Saddle-Point Problems with Bilinear Coupling

The First Optimal Algorithm for Smooth and Strongly-Convex-Strongly-Concave Minimax Optimization

In ordinary minimization, it is customary to describe the objective function (as a representative of some large class of functions) by constants that characterize the behavior of the function as a function of all its arguments. In fact, in many applications (especially when solving saddle problems), different blocks of variables are naturally distinguished, for which the properties of the objective function are significantly different. For example, this is manifested in multi-stage models of traffic flows in the search for equilibria.

Of course, the researcher can focus on the worst constants and use well-known methods, but how much can you benefit from using such an asymmetric (block) structure? Component-by-component methods provide a partial answer to this question. However, in the general case, the question is still open. Recently, quite a lot of progress has been made in this direction in terms of saddle problems with different constants of strongly convexity and strongly concavity, as well as different Lipschitz gradient constants. However, for tasks with a min min structure, the results more less.

Moreover, for both min max and min min tasks, questions about lower estimates are still open. In particular, for problems of small—dimensional convex optimization, lower estimates are obtained using a resisting oracle (3 Methods with linear convergence, II, but it's better to start right from the very first section of Lecture 1 - everything is clearer in the one-dimensional case).

While for problems of large dimension — with the help of the "world's worst function" — see, for example, the instructions for exercises 1.3 and 2.1 of the ICNMO manual, you can find in Optimization Methods. MIPT. In the work "Solving strongly convex-concave composite saddle problems with a small dimension of one of the groups of variables", min max problems are studied in which one of the groups of min variables has a small dimension, and the other group, on the contrary, a large one. The upper estimates are obtained.

It would be interesting to try to get lower estimates by combining two constructions. It seems that mathematically, the example of constructing a lower estimate will contain new interesting ideas. In the development of this project, it would be interesting to think about lower estimates for min min problems in which there is a small dimension for one of the groups of variables (non-smooth). The upper estimates are available in the papers "Solving convex min-min problems with smoothness and strong convexity for one of the groups of variables and small dimension of the other" and Solving smooth min-min and min-max problems by mixed oracle algorithms .

Also both lower and upper estimates are obtained for the number of calls to the gradient oracle. But in problems with different blocks of variables, you can talk about different oracles. For instance, in a saddle problem, we can talk about a gradient in min-variables and a gradient in max-variables. And if there are also unfriendly composites in the task (whose gradients will also come to be considered in view of "unfriendliness"), then there are many different oracles and, as a result, many criteria. And it is necessary to talk about optimal methods and lower estimates in the case of several criteria carefully. It seems interesting to construct an appropriate theory of optimal methods for problems with structure.

Areas of research

Areas of research

Media appearances

"In love with math" movie

Participation in scientific events

Educational and methodological work

Participation in project shifts

Dissertation defenses

Publishing activity

Leisure

International cooperation