Gradient Descent vs Stochastic Gradient Descent: Some Observations of Behaviour

Anyone who does any machine learning at all has heard of the gradient descent and stochastic gradient descent algorithms. What interests me about machine learning is understanding how some of these algorithms (and as a consequence the parameters they are estimating) behave in different scenarios.

The following are single observations from a random data generating process:

y(i) = x(i) +e(i)

where e is a random variable distributed in one of a variety of ways:

  • e is distributed N(0,1). This is a fairly standard experimental set up and should be considered “best case scenario” for most estimators.
  • e is distributed t(4). This is a data generation process that’s very fat tailed (leptokurtotic). It’s a fairly common feature in fields like finance. All the required moments for the ordinary least squares estimator to have all its usual properties exist, but only just. (If you don’t have a stats background, think of this as the edge of reasonable cases to test this estimator against.)
  • e is distributed as centred and standardised chi squared (2). In this case, the necessary moments exist and we have ensured e has a zero mean. But the noise isn’t just fat-tailed, it’s also skewed. Again, not a great scenario for most estimators, but a useful one to test against.

The independent variable x is intended to be deterministic, but in this case is generated as N(1,1). To get all the details of what I did, the code is here. Put simply, I estimated the ordinary least squares estimators for the intercept and coefficient of the process (e.g. the true intercept here is zero and the true coefficient is 1). I did this using stochastic gradient descent and gradient descent to find the minima of the likelihood and generate the estimates.

Limits: the following are just single  examples of what the algorithms do in each case. It’s not a full test of algorithm performance: we would repeat each of these thousands of times and then take overall performance measures to really know how these algorithms perform under these conditions. I just thought it would be interesting to see what happens. If I get some more time I’d develop this into a full simulation experiment if I find something worth exploring in detail.

Here’s what I found:

N is pretty large (10 000). Everything is super (for SGD)!

Realistically, unless dimensionality is huge, closed form ordinary least squares (OLS) estimation methods would do just fine here, I suspect. However, the real point here is to test differences between SGD and GD.

I used a random initialisation for the algorithm and you can see how quickly it got to the true parameter. In situations like this where the algorithm uses only a few iterations, if you are averaging SGD estimators then allowing for a burn in and letting it run beyond the given tolerance level may be very wise. In this case, averaging the parameters over the entire set of iterations would be a worse result than just taking the last estimator.

Under each scenario, the stochastic gradient descent algorithm performed brilliantly and much faster than the gradient descent equivalent which struggled. The different error distributions mattered very little here.

For reference, it took around 37 000 iterations for the gradient descent algorithm to reach its end point:

For the other experiments, I’ll just stick to a maximum of 10 000 iterations.

N is largish (N = 1000). Stochastic gradient descent doesn’t do too badly.

Now, to be fair, unless I was trying to estimate something with huge dimensionality (and this hasn’t been tested here anyway), I’d just use standard OLS estimating procedures for estimating this model in real life.

SGD takes marginally longer to get where it’s going, but the generating process made no material difference, although our example for chi squared was out slightly- this is one example.

N is smallish (N = 100): SGD gets brittle, but it gets somewhere close. GD has no clue.

Again, in real life, this is a moment for simpler methods but the interesting point here is that SGD takes a lot longer and fails to reach the exact true value, especially for the constant term (again, one example in each case, not a thorough investigation).

Here, GD has no clue and it’s taking SGD thousands more iterations to reach a less palatable conclusion. In practical machine learning contexts, it’s highly unlikely this is a sensible estimation method at these sample sizes: just use the standard OLS techniques.

I’m not destructive in general, but sometimes I like to break things.

I failed to achieve SGD armageddon by either weird error distributions (that were within the realms of reasonable for the estimator) or smallish sample sizes. So I decided to give it the doomsday scenario: smallish N and an error distribution that does not fulfil the requirements that the simple linear regression model needs to work. I tried a t(1) distribution for the error- this thing doesn’t even have a finite variance.

SGD and GD are utterly messed up (and I suspect the closed form solution may not be a whole lot better here either). Algorithmic armageddon, achieved. In practical terms, you can probably safely file this under “never going to happen”.

Leave a Comment