Using the Gradient Descent Algorithm in Machine Learning

Published in

Heartbeat

6 min readNov 8, 2021

Optimisation is an important aspect of machine learning. Nearly every machine learning algorithm incorporates an optimisation algorithm. Gradient Descent Algorithm is an iterative optimisation algorithm that finds the global minimum of an objective function. The GD algorithm is categorised based on accuracy and time-consuming factors, which are discussed in detail below. The GD algorithm is widely used for function minimisation in machine learning.

Gradient boosting isn’t easy to understand if you’re just getting started. The purpose of this article is to elaborate on gradient descent and give you an understanding of how it works.

In this article, we will quickly understand what a cost function is, the basics of gradient descent, how to choose the learning parameter, and how overshooting affects gradient descent. Let’s get started!

What is a Cost Function?

Cost function is a function that measures the performance of a model against a given set of data. A cost function identifies the error between predicted and expected values. This error is presented as a single number.

To determine the Cost function, we create a hypothesis with initial parameters. Intending to reduce the cost function, we use a Gradient descent algorithm over the data to modify the parameters. This is its mathematical representation:

What is Gradient Descent?

This is the million-dollar question!

Imagine you are playing a game in which players are at the top of a mountain and they have to get to the lowest point, the lake, on the mountain. In addition, the players are blindfolded. What is the best approach to reach the lake?

Think about this for a moment before reading on.

To determine where the land descends, it is best to observe the ground. Then, start descending and keep iterating this process until we reach the lowest point.

Finding the lowest point in a hilly landscape.

Gradient descent is an algorithm for finding a local minimum for a function by implementing iterative optimisation. Gradient descent can be used to find the local minimum of a function by taking steps proportional to the negative of the gradient (move away from the gradient) at the current point. As we move towards the gradient (moving forward), the function will approach a local maximum. This procedure is called Gradient Ascent.

It was Cauchy who first proposed gradient descent in 1847. It is also known as the steepest descent.

A gradient descent algorithm aims to minimize a given function (like cost) and the algorithm iteratively takes two steps to achieve this objective:

To determine the gradient (slope), the first-order derivative of the function at that point.
Make a step (move) in the direction opposite to the gradient, so that the slope at that point is increased by an amount alpha times the gradient at that point.

In the above equation, Alpha is called learning rate. It is a tuning parameter for optimisation. Alpha determines how long each step is.

Want to get the most up-to-date news on all things Deep Learning? Subscribe to Deep Learning Weekly and get the latest news delivered right to your inbox.

Gradient Descent Algorithm Plotting

If we have one parameter (theta), we can graph the dependent variable cost along the y-axis and theta along the x-axis. If there are two parameters, we could use a 3-D plot in which cost is on one axis, while the two parameters (thetas) are on the other two.

cost along z-axis and parameters(thetas) along x-axis and y-axis

You can also visualize it by using contours, an example of a three-dimensional plot in two dimensions showing parameters on both axes and the response. Responses increase in value away from the center and have the same value along with the rings. The response is directly proportional to the distance between a point and its center (over a direction).

Alpha — The Learning Rate

We know which direction we want to take, now we need to decide the size of the step we should take.

Note: A careful selection is required to reach local minima.

We might overshoot the minima and bounce without reaching it if the learning rate is too high.
A small learning rate could cause training to be too lengthy.

In the above graphs:

a) Learning rate is optimal, and the minimum is reached.
b) Learning rate is too low, it takes more time until it converges to the minimum.
c) Learning rate overshoots the optimal value, but then converges (1/C < η <2/C).
d) Learning rate is very high, it overshoots and diverges, moves away from the minima, and performance declines as a result.

Note: As you move towards the local minima, the gradient decreases and so does the step size. Therefore, the learning rate (alpha) can remain constant over the optimisation process and does not need to be varied iteratively.

Local Minima

Cost functions may contain many minimum points. Depending on the initial point (i.e. initial parameters (theta)) and the learning rate, the gradient may settle on any of the minima. Therefore, the optimisation may converge to different points depending on the learning rate and starting point.

Cost function convergence with varying starting points

Gradient Descent Implemented in Python

def train (X, y, W, B, alpha, max_iters):‘’’Performs GD on all training examples,X: Training data set,y: Labels for training data,W: Weights vector,B: Bias variable,alpha: The learning rate,max_iters : Maximum GD iterations.’’’dW = 0     # Weights gradient accumulatordB = 0     # Bias gradient accumulatorm = X.shape [0]   # No. of training examplesfor i in range (max_iters):dW = 0     # Resetting the accumulatorsdB = 0      for j in range (m):             # 1. Iterate over all examples,             # 2. Compute gradients of the weights and biases in            w_grad and b_grad,             # 3. Update dw by adding w_grad and dB by adding b_grad      W = W-alpha * (dW / m)  # Update the weights      B = B-alpha * (dB / m)   # Update the biasreturn W,B    # Return the updated weights and bias

In the next article, we will discuss the Python implementation of Gradient Descent and its variants in detail.

End Notes

Once we’ve tuned the learning parameter (alpha) and found the optimal learning rate, we will iterate until we reach the local minima.

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Comet Newsletter), join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.