In `PyTorch`

, we need to set the gradients to zero before starting to do backpropragation because PyTorch *accumulates the gradients* on subsequent backward passes. This is convenient while training RNNs. So, the default action is to accumulate (i.e. sum) the gradients on every `loss.backward()`

call.

Because of this, when you start your training loop, ideally you should `zero out the gradients`

so that you do the parameter update correctly. Else the gradient would point in some other direction than the intended direction towards the *minimum* (or *maximum* , in case of maximization objectives).

Here is a simple example:

```
import torch
from torch.autograd import Variable
import torch.optim as optim
def linear_model(x, W, b):
return torch.matmul(x, W) + b
data, targets = ...
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
optimizer = optim.Adam([W, b])
for sample, target in zip(data, targets):
# clear out the gradients of all Variables
# in this optimizer (i.e. W, b)
optimizer.zero_grad()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
optimizer.step()
```

Alternatively, if you're doing a *vanilla gradient descent* , then:

```
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
for sample, target in zip(data, targets):
# clear out the gradients of Variables
# (i.e. W, b)
W.grad.data.zero_()
b.grad.data.zero_()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
W -= learning_rate * W.grad.data
b -= learning_rate * b.grad.data
```

**Note** : The *accumulation* (i.e. *sum* ) of gradients happen when `.backward()`

is called on the `loss`

tensor.