Variance Reduction Methods for Convex Distributed Optimization

I. Theory

1. Intuition behind variance reduction methods

We consider the unconstrained optimization problem of minimizing a finite sum (of L-smooth convex functions) such as an empirical risk RnR_n.

minxRd f(x)=1ni=1nfi(x)\textrm{min}_{x\in \mathbb{R}^d} \ f(x) = \frac{1}{n} \sum_{i=1}^n f_i(x)

In a large dataset, there is often redundancy in the terms of the sum fif_i. The gradient of the fif_i can therefore be a good approximation of the batch gradient f\nabla f. The SGD method takes advantage of this property to reduce the complexity of descent algorithms. One of the limits of SGD however, is the often large variance of the gradient estimates fi\nabla f_i. Is it possible to reduce the variance of the gradient estimates used at each step?

Variance reduction methods that use gradient aggregation are based on the following observation:

Consider XX and YY two random variables and Z:=XY+E[Y]Z := X - Y + \mathbb{E}[Y], we have,

Take X=fiX = \nabla f_i.

A consequence of this observation is that, if we find YY such that YY is positively correlated to XX, we can build a new stochastic vector which is still close to the batch gradient in expectation and which has a smaller variance than fi\nabla f_i.

We have access to such a YY: in fact it is likely that from one iteration to the other, the descent directions do not differ a lot. It follows that these directions are likely to be positively correlated.

SVRG and SAGA are two different ways of building YY.

SAGA

At iteration k of the SAGA algorithm, the following update is executed:

{i chosen randomly in {1...n}xk+1xkγ(fi(xk)αi+αˉ)αifi(x)\begin{cases} i \textrm{ chosen randomly in } \{1...n\}\\ x_{k+1} \leftarrow x_{k} - \gamma\big(\nabla f_i(x_k) - \alpha_i + \bar{\alpha}\big) \\ \alpha_i \leftarrow \nabla f_i(x) \end{cases}

where (αi)i=1n(\alpha_i)_{i=1}^n are gradients evaluated at previous iterates and αˉ=1ni=1nαi\bar{\alpha}=\frac{1}{n}\sum^n_{i=1}\alpha_i.

Here
{X=fi(xk)Y=αiE[Y]=1ni=1nαi=αˉ\begin{cases} X =\nabla f_i(x_k) \\ Y = \alpha_i \\ \mathbb{E}[Y] = \frac{1}{n}\sum^n_{i=1}\alpha_i = \bar{\alpha} \end{cases}

SVRG

At iteration k of the SAGA algorithm, we perform the following:

{Compute the batch gradient f(xk)Initialise y0=xkfor m in 0...M: {i chosen randomly in {1...n}ym+1ymγ(fi(ym)fi(xk)+f(xk))xk+1yM\begin{cases} \textrm{Compute the batch gradient } \nabla f(x_k)\\ \textrm{Initialise }y_0 = x_k\\ \textrm{for m in 0...M: } \begin{cases} i \textrm{ chosen randomly in } \{1...n\} \\ y_{m+1} \leftarrow y_{m} - \gamma\big(\nabla f_i(y_m) - \nabla f_i(x_k) + \nabla f(x_k)) \end{cases}\\ x_{k+1} \leftarrow y_{M} \end{cases}

Here
{X=fim(ym)Y=fim(xk)E[Y]=1ni=1nfi(xk)=f(xk)\begin{cases} X =\nabla f_{i_m}(y_m) \\ Y =\nabla f_{i_m}(x_k) \\ \mathbb{E}[Y] = \frac{1}{n}\sum^n_{i=1}\nabla f_i(x_k) = \nabla f(x_k) \end{cases}

2. Comparison with other optimization algorithms

Let's see how these methods compare in terms of worst-case complexity expectation, i.e the number of gradient evaluations kk necessary to ensure E[xkf]ϵ\mathbb{E}[x_k - f*] \le \epsilon (note that our objective is considered LL-smooth thanks to the L2L_2 regularisation term):

Convex objective (μ=0)(\mu = 0) Strongly convex objective (μ>0)(\mu > 0)
GD O(nϵ)O\big(\frac{n}{\epsilon}\big) O(nLμlog1ϵ)O\big(n\frac{L}{\mu}\textrm{log}\frac{1}{\epsilon}\big)
SGD O(1ϵ2)O\big(\frac{1}{\epsilon^2}\big) O(Lμlog1ϵ)O\big(\frac{L}{\mu}\textrm{log}\frac{1}{\epsilon}\big)
SAG, SAGA, SVRG O(nϵ)O\big(\frac{\sqrt{n}}{\epsilon}\big) O((n+Lμ)log1ϵ)O\big(\big(n+\frac{L}{\mu}\big)\textrm{log}\frac{1}{\epsilon}\big)

Although SAG, SAGA, SVRG present fast rates of convergence, they are not always superior to SG:

II. Implementation

We study the minimization of the empirical risk RnR_n for a logistic regression model:

minwRd Rn(w)=1ni=1nfi(w)+r(w), where {fi(w)=LogisticLoss(yi,wTXi)r(w)=λ1w+12λ2w2\textrm{min}_{w\in \mathbb{R}^d} \ R_n(w) = \frac{1}{n} \sum_{i=1}^n f_i(w) + r(w), \textrm{ where } \begin{cases} f_i(w) = \textrm{LogisticLoss}(y_i, w^T X_i) \\ r(w) = \lambda_1 \|w\| + \frac{1}{2} \lambda_2 \|w\|^2 \end{cases}

What this implementation brings:

1. Data

datasets.load_student_data(file, split=0.25)

This function does all the preprocessing required on the dataset and splits it into training and testing data (split is the proportion of testing data).

from datasets import load_student_data

X_train, X_test, y_train, y_test = load_student_data('data/student-mat.csv', split=0.25)

2. Logistic Regression

class linear_model.LogisticRegression(solver='GD', l1=0.02, l2=0.1, max_iter=50)

Parameters:

Attributes:

Methods

from my_lib.linear_model import LogisticRegression

epochs = 30
n = len(y_train)

clf0 = LogisticRegression(solver='GD', l1=0.02, l2=0.1, max_iter=epochs)
clf1 = LogisticRegression(solver='SGD', l1=0.02, l2=0.1, max_iter=epochs*n)
clf2 = LogisticRegression(solver='SAG', l1=0.02, l2=0.1, max_iter=epochs*n)
clf3 = LogisticRegression(solver='SAGA', l1=0.02, l2=0.1, max_iter=epochs*n)
clf4 = LogisticRegression(solver='SVRG', l1=0.02, l2=0.1, max_iter=epochs)

clf0.fit(X_train, y_train)
clf1.fit(X_train, y_train)
clf2.fit(X_train, y_train)
clf3.fit(X_train, y_train)
clf4.fit(X_train, y_train)

3. Evaluation

visuals.learning_curve(Name1=clf1, Name2=clf2)

This function takes any number of parameters Name=clf, where Name is a display name and clf is a fitted classifier. It plots the evolution of the empirical risk with regards to the number of epochs.

from my_lib.visuals import learning_curve

learning_curve(GD=clf0, SGD=clf1, SAG=clf2, SAGA=clf3, SVRG=clf4)

output_17_0

for clf, name in zip([clf0, clf1, clf2, clf3, clf4], ['GD', 'SGD', 'SAG', 'SAGA', 'SVRG']):
    print(name + ':' \
          + "\tAccuracy on training set: {0:0.1f}%".format(100*clf.score(X_train, y_train)))
    print("\tAccuracy on testing set:  {0:0.1f}%".format(100*clf.score(X_test, y_test)))
GD:	Accuracy on training set: 90.5%
	Accuracy on testing set:  81.8%
SGD:	Accuracy on training set: 93.2%
	Accuracy on testing set:  87.9%
SAG:	Accuracy on training set: 93.9%
	Accuracy on testing set:  87.9%
SAGA:	Accuracy on training set: 93.9%
	Accuracy on testing set:  87.9%
SVRG:	Accuracy on training set: 93.9%
	Accuracy on testing set:  87.9%

3. Parallelization

The advantage of parallelization on a personal computer and for a relatively small dataset is not obvious. In fact, in this setting, the overhead of synchronisation is enough to make the parallel version of SGD slower.

Note 1: The number of jobs is chosen automatically depending on the number of detected CPUs.

Note 2: The hogwildSGD solver uses the joblib library.

from my_lib.tools import Time

epochs = 30
n = len(y_train)

clf5 = LogisticRegression(solver='SGD', l1=0.02, l2=0.1, max_iter=epochs*n)
clf6 = LogisticRegression(solver='hogwildSGD', l1=0.02, l2=0.1, max_iter=epochs*n)

with Time() as runtime:
    clf5.fit(X_train, y_train)
print("Computation time of serial SGD {0:0.3f} seconds".format(runtime()))

with Time() as runtime:
    clf6.fit(X_train, y_train)
print("Computation time of parallel SGD {0:0.3f} seconds".format(runtime()))

learning_curve(SGD=clf5, parallelSGD=clf6)
Computation time of serial SGD 0.497 seconds
Detected CPU:  4
Computation time of parallel SGD 1.167 seconds

output_21_1

References