We consider the unconstrained optimization problem of minimizing a finite sum (of L-smooth convex functions) such as an empirical risk .
In a large dataset, there is often redundancy in the terms of the sum . The gradient of the can therefore be a good approximation of the batch gradient . The SGD method takes advantage of this property to reduce the complexity of descent algorithms. One of the limits of SGD however, is the often large variance of the gradient estimates . Is it possible to reduce the variance of the gradient estimates used at each step?
Variance reduction methods that use gradient aggregation are based on the following observation:
Consider and two random variables and , we have,
- , therefore is an unbiased estimate of .
- , therefore, if and are positively corellated and if is big enough, is reduced compared to .
Take .
A consequence of this observation is that, if we find such that is positively correlated to , we can build a new stochastic vector which is still close to the batch gradient in expectation and which has a smaller variance than .
We have access to such a : in fact it is likely that from one iteration to the other, the descent directions do not differ a lot. It follows that these directions are likely to be positively correlated.
SVRG and SAGA are two different ways of building .
At iteration k of the SAGA algorithm, the following update is executed:
where are gradients evaluated at previous iterates and .
Here
At iteration k of the SAGA algorithm, we perform the following:
Here
Let's see how these methods compare in terms of worst-case complexity expectation, i.e the number of gradient evaluations necessary to ensure (note that our objective is considered -smooth thanks to the regularisation term):
Convex objective | Strongly convex objective | |
---|---|---|
GD | ||
SGD | ||
SAG, SAGA, SVRG |
Although SAG, SAGA, SVRG present fast rates of convergence, they are not always superior to SG:
We study the minimization of the empirical risk for a logistic regression model:
What this implementation brings:
Sklearn
).datasets.load_student_data(file, split=0.25)This function does all the preprocessing required on the dataset and splits it into training and testing data (
split
is the proportion of testing data).
from datasets import load_student_data X_train, X_test, y_train, y_test = load_student_data('data/student-mat.csv', split=0.25)
class linear_model.LogisticRegression(solver='GD', l1=0.02, l2=0.1, max_iter=50)Parameters:
l1
andl2
: Used to specify the and regression penaltysolver
: Optimization method, to be chosen among‘GD’
,‘SGD’
,‘SAG’
,‘SAGA’
,‘SVRG’
max_iter
: Number of iterations (i.e. number of descent steps). Note that for GD or SVRG, one iteration is one epoch i.e, one pass through the data, while for SGD, SAG and SAGA, one iteration uses only one data point.Attributes:
coef_
: Coefficient of the features in the decision function.Methods
fit(X,y)
: Fit the model according to the given training data.predict(X)
: Predict class labels for samples in X.decision_function(X)
: Predict confidence scores for samples in X.score(X, y)
: Returns the mean accuracy on the given test data and labels.
from my_lib.linear_model import LogisticRegression epochs = 30 n = len(y_train) clf0 = LogisticRegression(solver='GD', l1=0.02, l2=0.1, max_iter=epochs) clf1 = LogisticRegression(solver='SGD', l1=0.02, l2=0.1, max_iter=epochs*n) clf2 = LogisticRegression(solver='SAG', l1=0.02, l2=0.1, max_iter=epochs*n) clf3 = LogisticRegression(solver='SAGA', l1=0.02, l2=0.1, max_iter=epochs*n) clf4 = LogisticRegression(solver='SVRG', l1=0.02, l2=0.1, max_iter=epochs) clf0.fit(X_train, y_train) clf1.fit(X_train, y_train) clf2.fit(X_train, y_train) clf3.fit(X_train, y_train) clf4.fit(X_train, y_train)
visuals.learning_curve(Name1=clf1, Name2=clf2)This function takes any number of parameters
Name=clf
, whereName
is a display name andclf
is a fitted classifier. It plots the evolution of the empirical risk with regards to the number of epochs.
from my_lib.visuals import learning_curve learning_curve(GD=clf0, SGD=clf1, SAG=clf2, SAGA=clf3, SVRG=clf4)
for clf, name in zip([clf0, clf1, clf2, clf3, clf4], ['GD', 'SGD', 'SAG', 'SAGA', 'SVRG']): print(name + ':' \ + "\tAccuracy on training set: {0:0.1f}%".format(100*clf.score(X_train, y_train))) print("\tAccuracy on testing set: {0:0.1f}%".format(100*clf.score(X_test, y_test)))
GD: Accuracy on training set: 90.5% Accuracy on testing set: 81.8% SGD: Accuracy on training set: 93.2% Accuracy on testing set: 87.9% SAG: Accuracy on training set: 93.9% Accuracy on testing set: 87.9% SAGA: Accuracy on training set: 93.9% Accuracy on testing set: 87.9% SVRG: Accuracy on training set: 93.9% Accuracy on testing set: 87.9%
The advantage of parallelization on a personal computer and for a relatively small dataset is not obvious. In fact, in this setting, the overhead of synchronisation is enough to make the parallel version of SGD slower.
Note 1: The number of jobs is chosen automatically depending on the number of detected CPUs.
Note 2: The hogwildSGD
solver uses the joblib library.
from my_lib.tools import Time epochs = 30 n = len(y_train) clf5 = LogisticRegression(solver='SGD', l1=0.02, l2=0.1, max_iter=epochs*n) clf6 = LogisticRegression(solver='hogwildSGD', l1=0.02, l2=0.1, max_iter=epochs*n) with Time() as runtime: clf5.fit(X_train, y_train) print("Computation time of serial SGD {0:0.3f} seconds".format(runtime())) with Time() as runtime: clf6.fit(X_train, y_train) print("Computation time of parallel SGD {0:0.3f} seconds".format(runtime())) learning_curve(SGD=clf5, parallelSGD=clf6)
Computation time of serial SGD 0.497 seconds Detected CPU: 4 Computation time of parallel SGD 1.167 seconds