Last week, we saw how to code a simple network from
scratch
,
using nothing but torch tensors. Predictions, loss, gradients,
weight updates – all these things we’ve been computing ourselves.
Today, we make a significant change: Namely, we spare ourselves the
cumbersome calculation of gradients, and have torch do it for us.
Prior to that though, let’s get some background.
Automatic differentiation with autograd#
torch uses a module called autograd to
-
record operations performed on tensors, and
-
store what will have to be done to obtain the corresponding gradients, once we’re entering the backward pass.
These prospective actions are stored internally as functions, and when it’s time to compute the gradients, these functions are applied in order: Application starts from the output node, and calculated gradients are successively propagated back through the network. This is a form of reverse mode automatic differentiation.
Autograd basics#
As users, we can see a bit of the implementation. As a prerequisite for
this “recording” to happen, tensors have to be created with
requires_grad = TRUE. For example:
|
|
To be clear, x now is a tensor with respect to which gradients have
to be calculated – normally, a tensor representing a weight or a bias,
not the input data 1. If we subsequently perform some operation on
that tensor, assigning the result to y,
|
|
we find that y now has a non-empty grad_fn that tells torch how to
compute the gradient of y with respect to x:
|
|
MeanBackward0
Actual computation of gradients is triggered by calling backward()
on the output tensor.
|
|
After backward() has been called, x has a non-null field termed
grad that stores the gradient of y with respect to x:
|
|
torch_tensor
0.2500 0.2500
0.2500 0.2500
[ CPUFloatType{2,2} ]
With longer chains of computations, we can take a glance at how torch
builds up a graph of backward operations. Here is a slightly more
complex example – feel free to skip if you’re not the type who just
has to peek into things for them to make sense.
Digging deeper#
We build up a simple graph of tensors, with inputs x1 and x2 being
connected to output out by intermediaries y and z.
|
|
To save memory, intermediate gradients are normally not being stored.
Calling retain_grad() on a tensor allows one to deviate from this
default. Let’s do this here, for the sake of demonstration:
|
|
Now we can go backwards through the graph and inspect torch’s action
plan for backprop, starting from out$grad_fn, like so:
|
|
MeanBackward0
|
|
[[1]]
MulBackward1
|
|
[[1]]
PowBackward0
|
|
[[1]]
MulBackward0
|
|
[[1]]
torch::autograd::AccumulateGrad
[[2]]
AddBackward1
|
|
[[1]]
torch::autograd::AccumulateGrad
If we now call out$backward(), all tensors in the graph will have
their respective gradients calculated.
|
|
torch_tensor
0.2500 0.2500
0.2500 0.2500
[ CPUFloatType{2,2} ]
torch_tensor
4.6500 4.6500
4.6500 4.6500
[ CPUFloatType{2,2} ]
torch_tensor
18.6000
[ CPUFloatType{1} ]
torch_tensor
14.4150 14.4150
14.4150 14.4150
[ CPUFloatType{2,2} ]
After this nerdy excursion, let’s see how autograd makes our network simpler.
The simple network, now using autograd#
Thanks to autograd, we say good-bye to the tedious, error-prone
process of coding backpropagation ourselves. A single method call does
it all: loss$backward().
With torch keeping track of operations as required, we don’t even have
to explicitly name the intermediate tensors any more. We can code
forward pass, loss calculation, and backward pass in just three lines:
|
|
Here is the complete code. We’re at an intermediate stage: We still manually compute the forward pass and the loss, and we still manually update the weights. Due to the latter, there is something I need to explain. But I’ll let you check out the new version first:
|
|
As explained above, after some_tensor$backward(), all tensors
preceding it in the graph2 will have their grad fields populated.
We make use of these fields to update the weights. But now that
autograd is “on”, whenever we execute an operation we don’t want
recorded for backprop, we need to explicitly exempt it: This is why we
wrap the weight updates in a call to with_no_grad().
While this is something you may file under “nice to know” – after all,
once we arrive at the last post in the series, this manual updating of
weights will be gone – the idiom of zeroing gradients is here to
stay: Values stored in grad fields accumulate; whenever we’re done
using them, we need to zero them out before reuse.
Outlook#
So where do we stand? We started out coding a network completely from
scratch, making use of nothing but torch tensors. Today, we got
significant help from autograd.
But we’re still manually updating the weights, – and aren’t deep learning frameworks known to provide abstractions (“layers”, or: “modules”) on top of tensor computations …?
We address both issues in the follow-up installments. Thanks for reading!