A Descent Into Gradient Descent
Upon first learning about gradient descent, I assumed I understood how it works. While my cursory understanding allowed me to plug in an SGDClassifier( ) model to get an optimal value, I really couldn’t explain the magic behind it. Given that it serves as the backbone to many machine learning algorithms I decided a cursory understanding was probably not enough. This post is meant to explain gradient descent a bit deeper than some of the simple metaphors used to explain it.
These metaphors often used to describe its workings belie a more complex underbelly with spinning cogs that seem to spin despite missing parts (this will, hopefully, make sense later). They often go something like the following: Pretend you are blindfolded and need to reach the peak of a mountain or bottom of a valley. How do you find your way? You would shuffle your feet, feeling for the slope of the floor (moving up or down), and move in the appropriate direction until you reach the peak/valley. Now in gradient descent just replace the peak/valley with your optimal value for a given loss function, replace the slope with slope, and the step size with step size (quite the apt metaphor). While, admittedly, this metaphor captures how gradient descent works, there is much more behind the scenes going on.
An good understanding of gradient descent must include an understanding of derivatives. In calculus the derivative of a function “measures the sensitivity to change of the function value (output value) with respect to a change in its argument (input value)” (https://en.wikipedia.org/wiki/Derivative). This is often referred to as the “instantaneous rate of change”. Although technically you can’t have an instantaneous change, as change is measured from one instant to another instant, the idea still stands that derivatives are used to measure how things change. To see where we would use derivatives consider the slope of a straight line (itself a measurement of change). We can determine the rate of change by taking two points and dividing the difference in their horizontal values and vertical values, otherwise known in middle school classrooms everywhere as rise/run. This works great for a line but what if we wanted to determine the rate of change for a curved line whose rate of change may fluctuate depending on where in the line you are. For example, you may try taking the “secant” for two points as shown in the graph below:
Now what if you wanted to map the secant for two lower points in the curve? The “slope” of the line would then be flatter. Instead, here is where we move into the realm of calculus and derivatives, and by proxy gradient descent. By using the derivative we can find the rate of change at any given point by using the tangent line. In layman’s terms, a tangent line is simply a straight line that just barely touches a curved line as shown below:
A bit more formally, how we get the tangent line can be thought of as taking the points of the secant shown earlier and moving one point along the line (changing the x value) until the difference in the points’ distances is almost 0, essentially creating a point. Though it is technically not a point, being that if you continuously decrease the distance between the points it could go on indefinitely, the distance calculated is so small it provides us a good estimate of a point. At a basic level, this is how, by using the derivative we can find the “instantaneous rate of change” for a given point in a line.
This slope of that tangent line at a given point is the indicator we will use in gradient descent to determine where and how big a step we should take when trying to find the optimal parameter values to minimize loss. We can plot the parameter’s (like the example below) value we want to optimize for on the x-axis and the loss on the y-axis. From here we would get a convex curve as the parameter values change which, given enough points will draw a curved line as the mapped points approach and pass the optimal value. When the slope of a plotted point becomes horizontal and equals 0 we have reached max optimization, or the bottom of the curve like in the metaphor earlier.
The world of derivatives is deep and requires a whole blog post in itself, but this should give us a basic idea of what’s going on in gradient descent. The following example will explain gradient descent in a linear regression model where the loss function you are trying to minimize is the Sum of Square Residuals, but the same steps can be generalized and applied to any loss function. The big picture of gradient descent is to change parameters in a model in order to get the best value. With that in mind we will lay out a roadmap provided by Josh Starmer of StatsQuest then subsequently explain each step (StatsQuest: Gradient Descent Step-By-Step):
1) Take the derivative of the Loss Function for each parameter in it. In fancy Machine Learning Lingo, take the Gradient of the Loss Function
2) Pick random values for the parameters
3) Plug the parameter values into the derivatives (ahem, the Gradient)
4) Calculate the Step Sizes: Step Size = Slope * Learning Rate
5) Calculate the New Parameters: New Parameter = Old Parameter — Step Size
6) Now Go back to Step 3 and repeat until Step Size is very small, or you reach the Maximum Number of Steps
Step 1: Take the derivative of the Loss Function for each parameter in it. In fancy Machine Learning Lingo, take the Gradient of the Loss Function
Let’s start by saying we want to optimize our intercept and slope values. We can generalize our loss formula for the Sum of Square Residuals (SSR) to account for our unknown best intercept and slope. Where SSR can be thought of as the sum of (observed - predicted)², we can rewrite it as: SSR = (y_value - (intercept + slope * x_value))² + same formula for next data point. If we take the point circled in red below we can plug it into the formula: SSR = (5 - (intercept + slope * 1))²
Now we can take the derivative of the SSR, first in respect to the slope, then in respect to the intercept. Again, we do this for each point then sum them together to get the derivative of the SSR (in respect to the parameter you are looking for). When you have multiple derivatives for the same function (intercept and slope) this is referred to as a gradient, hence Gradient Descent. So, we can plug in the values for the first point: (1 - (intercept + slope * 5))². Using the chain rule we move the exponent to the front: 2(1 - (intercept + slope * 5)). Now we take the derivative of the inside function remembering that the derivative of a constant (slope) is 0, so: 1+(-1)intercept - slope * 5 gives us -1. Now our formula looks like: 2(1 - (intercept + slope * 5)) * -1 which can be simplified to -2(1 - (intercept + slope * 5)).
2) Pick random values for the parameters
This step is simply picking random values for the slope and intercept in order to give us a regression line to start with. This line will be optimized using our gradient descent.
3) Plug the parameter values into the derivatives (ahem, the Gradient)
Using the random values you chose, plug them into the derivatives for each point in the derivative of the SSR formula. The two values you get will give you your slope. As mentioned above, the slope of the point along that loss/weight values curve tells us how big a change we need to make to find our best values.
4) Calculate the Step Sizes: Step Size = Slope * Learning Rate
Now that we have our slope we can determine the step size by multiplying the slope by the learning rate. Many people consider learning rate to be the most important hyperparameter in gradient descent. Learning Rate is a number we determine that dictates how big a step we will be taking when updating the the next step for each iteration. The goal with learning rate is finding a good balance. Too high and the model will overshoot and end on a suboptimal value, while too low and the steps will be so small they can take too long or get stuck trying to find the optimal value.
5) Calculate the New Parameters: New Parameter = Old Parameter - Step Size
The new regression line will be the values taken from the old parameters for intercept and slope each subtracted by the step size.
6) Now Go back to Step 3 and repeat until Step Size is very small, or you reach the Maximum Number of Steps
This last step is pretty self-explanatory.
To reiterate, this same process can be applied to any loss function. You just substitute the intercept and slope with any value you want to optimize for. I know that this blog didn’t cover everything there is to know about gradient descent (like how exactly do I calculate the derivative or how do I find the optimal learning rate) but hopefully going through the process gave some insight into what it is we are calculating. If you feel inclined you can also visit some of the links below to find more info on the topics discussed.