Effective Comparison of Unconstrained Optimization Algorithms

Burcu Koca
7 min readMar 14, 2022

Most machine learning problems are, ultimately optimization problems. We will review and introduce some optimization algorithms in this blog. Now let’s look at the definitions with a clear explanation.

Optimization and Machine Learning

The basic idea behind much of machine learning is to build models that map input data to output data. We derive parameters of these models based on training on some given data. We can see the whole of machine learning as a process that finds the optimal model for given data.

Optimization:

In general, the task of optimization is to maximize or minimize a function F(x). The function is called the objective, cost, and loss function. We will look at the unconstrained problem. Find x that minimizes F(x) with x ∈ R. That is no constraint on x.

Maxima & Minima

If F’(x) = 0, such points are called critical points.

If F’’(x) < 0 , it is a local max

If F’’(x) >0 , it is a local min

If F’’(x) = 0, it could be a saddle point.

The lowest-highest level of F(x) is called global max-min.

The definitions for scalar x.

In multivariate x, analyze with the hessian matrix. Hessian is a square matrix of second-order partial derivatives of a scalar-valued function.

This blog provides the basic theoretical understanding of unconstrained optimization functions and also includes a python implementation of them.

Some Optimization Methods

Newton Raphson Method :

Also known as Newton’s method, given an initial guess of the root x[0], uses information about the function and its derivative at that point to find a better guess of the root. The assumption behind this method is that our target function f(x), the one we want to optimize, is twice differentiable and f”(x) is not equal to zero. Here, we will be working with a smooth polynomial.

Let’s code the algorithm now:

def example(x,y):
return 2*x**3+4*x*y**3 - 10*x*y + y**3
def Grad_ex(x,y):
g1 = 6*x**2 + 4*y**3 - 10*y
g2 = 12*x*y**2-10*x+3*y**2
return np.array([g1,g2])
def Hessian_ex(x,y):
h11 = 12*x
h12 = 12*y**2-10
h21 = 12*y**2-10
h22 = 24*x*y +6*y
return np.array([[h11,h12],[h21,h22]])

Defining the function, first order derivatives of the function at point x, and second order derivatives of the function at point x.

def Newton_Raphson_Optimize(Grad, Hess, x,y, epsilon=0.01, nMax = 100):
i = 0
iter_x, iter_y, iter_count = np.empty(0),np.empty(0), np.empty(0)
error = 10
X = np.array([x,y])

#Looping as long as error is greater than epsilon
while np.linalg.norm(error) > epsilon and i < nMax:
i +=1
iter_x = np.append(iter_x,x)
iter_y = np.append(iter_y,y)
iter_count = np.append(iter_count ,i)

X_prev = X
X = X - np.linalg.inv(Hess(x,y)) @ Grad(x,y)
error = X - X_prev
x,y = X[0], X[1]
print(X)
print(iter_count)
return X, iter_x,iter_y, iter_count
root,iter_x,iter_y, iter_count =Newton_Raphson_Optimize(Grad_ex,Hessian_ex,5,5)

We added a nMax(max iterations)parameter, so that the algorithm doesn’t run indefinitely if it doesn’t converge. We got the initial point [5,5].

We get this result:

[3.61231102 3.21814255]
[1.]
[25.41905575 -6.62369213]
[2.]
[13.38091735 -4.89780992]
[3.]
[ 7.10299606 -3.62111505]
[4.]
[ 3.75262886 -2.69425608]
[5.]
[ 1.90126539 -2.04368383]
[6.]
[ 0.80459909 -1.61900739]
[7.]
[ 0.02642911 -1.41327176]
[8.]
[-0.55684802 -1.60376387]
[9.]
[-0.36640994 -1.60978734]
[10.]
[-0.36631292 -1.62004825]
[11.]
[-0.36632721 -1.61995635]
[12.]

The algorithm converged in only 12 iterations!

Newton’s Method w/ 12 iterations

Gradient Descent:

If you worked with machine learning models, you’ve probably seen this already. Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function.

Step1: Pick a random point w in the function, this is the starting point

Step2: While the gradient hasn’t converged:

Step2a: Compute the negative gradient at w, the gradient going in the opposite direction.

Step2b: Update point w with it the result of 2a, and go back to step 1.

Step3: Success, you’ve found the minimum!

The goal of the gradient descent algorithm is to minimize the given function (say cost function). To achieve this goal, it performs two steps iteratively:

  1. Compute the gradient (slope), the first-order derivative of the function at that point.

2. Make a step (move) in the direction opposite to the gradient, the opposite direction of slope increases from the current pointby alpha times the gradient at that point.

The algorithm becomes:

def Gradient_Descent(Grad,x,y, gamma = 0.00125, epsilon=0.01, nMax = 1000000 ):
#Initialization
i = 0
iter_x, iter_y, iter_count = np.empty(0),np.empty(0), np.empty(0)
error = 10
X = np.array([x,y])

#Looping as long as error is greater than epsilon
while np.linalg.norm(error) > epsilon and i < nMax:
i +=1
iter_x = np.append(iter_x,x)
iter_y = np.append(iter_y,y)
iter_count = np.append(iter_count ,i)
#print(X)

X_prev = X
X = X - gamma * Grad(x,y)
error = X - X_prev
x,y = X[0], X[1]

print(X)
return X, iter_x,iter_y, iter_count
root,iter_x,iter_y, iter_count = Gradient_Descent(Grad_ex,5,5)

You can tweak the values of epsilon(learning rate), and nMax(max iterations). We got as the same initial points with newton’s method.

This is the result:

[1.51632092 0.86177917]
GD Method w/ 71 iterations

Gradient descent took 71 iterations whereas Newton’s method took only 12!

Newton’s method has stronger constraints in terms of the differentiability of the function than gradient descent. If the second derivative of the function is undefined in the function’s root, then we can apply gradient descent on it but not Newton’s method. Gradient Descent is a powerful algorithm. But in real-world scenarios with ever-growing datasets, it has two major limitations:

  • Calculating derivatives for the entire dataset is time-consuming,
  • The memory required is proportional to the size of the dataset.

These computations are typically done in memory so, the larger the dataset, the greater the memory requirements. This is where Stochastic Gradient Descent comes in!

Stochastic Gradient Descent (SGD):

Stochastic, in plain terms, means “random”. It improves on the limitations of Gradient Descent and performs much better in large-scale datasets.

SGD cycles through training data:

  1. Cyclical (incremental gradient descent) >> i[k]= kmod N i[k]=5,6,7,8,5,6,7,8…
  2. Random permutation (reshuffle every N rounds) >> i[k]= 6,7,8,5,7,6,8,5,8,6,7,5…
  3. Stochastic gradient descent (uniformly at random) >> i[k]=uniform {6,7,8,….N} i[k]= 7,6,8,6,7,8,8,6,7…

Update weight at each iteration. On average gives a gradient.

SGD in python:

def stochastic_gradient_descent(x_1, x_2,learning_rate=0.01, iterations=50):# x_1,x_2 are a list that determined randomlyn = len(x_1)for i in range(iterations):# get a random value for x1 and x2rand_ind = np.random.randint(0, n)rand_ind2 = np.random.randint(0, n)x_1_i = x_1[rand_ind]x_2_i = x_2[rand_ind2]# change of x_1_i,x_2_i by using learning rate and gradient.x_1_i = x_1_i - learning_rate*(6*x_1_i**2+4*x_2_i**3-10*x_2_i)x_2_i = x_2_i - learning_rate*(12*x_1_i*x_2_i**2-10*x_1_i+6*x_2_i**2)# f is function to be minimized.f = 2*x_1_i**3+ 4*x_1_i*x_2_i**3 -10*x_2_i*x_1_i+ x_2_i**3# x1,x2 values which make function lower than 1 that added to lists.if (f <= 1 and f >=0 ):x1.append(x_1_i)x2.append(x_2_i)# Function values are added to function history list.f_function_history.append(f)return f_function_historyx_1=np.arange(1,20)x_2 = np.arange(1,20)f_function_history = []
x1 = []
x2 = []
stochastic_gradient_descent(x_1,x_2)

As you can see it didn’t decrease linearly like normal Gradient Descent. Sometimes it increased, sometimes it decreased. Because we randomly selected points.

Comments

Following recent trends in unconstrained optimization, we can notice that all optimization methods, which are considered in this blog, are still actual.

We’ve talked about two types of optimization algorithms Newton’s method and the Gradient Descent method for optimizing the objective function.

Let’s summarize some of their properties of them.

  1. First difference Newton’s method is slightly more complex than GD. (i.e not require a matrix inversion.)

2. GD needs a choice of learning rate whereas Newton’s method has no params that you need to set.

3. GD needs more iterations whereas Newton’s method needs fewer iterations.

4. GD’s every iteration is cheaper than Newton’s method and so if n (# of features) is small, Newton’s method may be preferable and if n is large then the cost per iteration of gradient descent would be much cheaper and then GD may be preferable. (In some cases, this approach can reduce computation time and it might better escape the Local Minima issue. So, SGD is preferred over Gradient Descent for optimizing a learning algorithm.)

--

--

Burcu Koca

Industrial Engineer, Interested in Machine Learning, Aritificial Intelligence, and Data Science.