Working Notes - Learning PyTorch - Day 1

Apr 11, 2022 17:06 · 2459 words · 12 minute read Machine Learning AI

Summary

Covered boilerplate for getting started, tensors, gradients.
Poked in some detail at how gradients work.
Played with linear regression, nonlinear regression, both with gradient descent.
Built an MNIST classifier following a PyTorch tutorial. Experimented with the shape and size of the NN.
Got PyTorch running on a machine with a GPU.

Boilerplate

Import:

import torch

Need to use a special float type for the elements of tensors, and need to know which device the compute graph will run on:

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0")  # Uncomment this to run on GPU

Tensors

Act like numpy arrays at time-of-compute. Support most of the same operations.
Build a compute graph along the way. This can be run on various target devices.
Initialized much like numpy arrays (e.g. x = torch.linspace(0,1, 200, device=device, dtype=dtype))

Gradients

PyTorch implements automatic differentiation.
Derivatives are propagated through the tensor compute graph.
Tensors have a requires_grad boolean attribute.
- If this is True this tells the PyTorch to track gradients on all operations involving them.
- If it’s False presumably that information isn’t tracked.
Differentiation is primarily done through backwards propagation, implemented by the Q.backwards(...) method.
- Here Q is some tensor.
- This calculates the derivative of Q with respect to the leaf tensors of the graph, which is accumulated in the .grad attribute of those tensors.
  - Because this is an accumulation, those leafs need to be zero’d out explicitly if this is called repeatedly.
The gradient is always of a scalar with respect to the leaf tensors.
- Q.backwards() is an invalid call unless Q is a scalar.
- So if Q is a tensor we have to supply a tensor R to dot it against to form a scalar.
  - Q.backwards(gradient=R) is a valid call if R has the same shape as Q, and computes the gradient of the inner product Q.R with respect to the leaf tensors.
- This means that the .grad attribute of the leaf tensors always has the same shape as those tensors.

Toy Examples:

Construct $Q = a^2$ where $a$ has rank-1 and compute $\partial \sum(Q)/\partial a$. Outputs $[2a_0,2a_1]$.

import torch

a = torch.tensor([2., 3.], requires_grad=True)
Q = a**2

Jacobian_vector = torch.tensor([1,1])
Q.backward(gradient=Jacobian_vector)
print(a.grad)

Construct $Q=a_0^2+a_1^2$ and compute $\partial Q/\partial a$. Outputs $[2a_0,2a_1]$.

import torch

a = torch.tensor([2., 3.], requires_grad=True)
Q = a[0]**2 + a[1]**2

Q.backward()
print(a.grad)

Linear Regression Example

Linear regression is used to solve $A.x=b$, which is equivalent to minimizing $|A.x-b|^2$. So let’s do this with gradient descent:

import torch

b = torch.tensor([1.,1], requires_grad=False)
A = torch.tensor([[0.,1],[1,0]], requires_grad=False)
x = torch.tensor([0.,0], requires_grad=True)
gradient_damp = 0.5

for i in range(10):
	loss = ((torch.einsum('ij,j->i',A,x)-b)**2).sum() # Construct loss
	loss.backward() # Backprop
	with torch.no_grad(): # Don't track gradients in this step
		x -= gradient_damp * x.grad # Gradient descent
		x.grad = None # Zero out for next iteration

loss = ((torch.einsum('ij,j->i',A,x)-b)**2).sum() # Construct loss
print(loss)
print(x)

This (correctly) returns the solution $x=[1,1]$ with zero loss. I used einsum because I find it easier to explicitly write out my indexing than trying to remember how e.g. dot or matmul work.

Note that we had to tell PyTorch not to track gradients during the updates on x. This is because we don’t want to track the gradient of the new x with respect to the old x (which no longer exists in memory once we define the new x). If we don’t specifically call this out we get a RunTime error because PyTorch can see that this op doesn’t make any sense in terms of the compute graph.

Also note that we didn’t need to require gradients on b or A, because those aren’t parameters in our model.

Returning to the example above, it happens that the first gradient step is all we need. If we print x in the intermediate steps we get:

tensor([0., 0.], requires_grad=True)
tensor([1., 1.], requires_grad=True)
tensor([1., 1.], requires_grad=True)
tensor([1., 1.], requires_grad=True)
tensor([1., 1.], requires_grad=True)
tensor([1., 1.], requires_grad=True)
tensor([1., 1.], requires_grad=True)
tensor([1., 1.], requires_grad=True)
tensor([1., 1.], requires_grad=True)
tensor([1., 1.], requires_grad=True)

This is a quirk of the damping factor we picked. If we use gradient_damp=0.1 we get

tensor([0., 0.], requires_grad=True)
tensor([0.2000, 0.2000], requires_grad=True)
tensor([0.3600, 0.3600], requires_grad=True)
tensor([0.4880, 0.4880], requires_grad=True)
tensor([0.5904, 0.5904], requires_grad=True)
tensor([0.6723, 0.6723], requires_grad=True)
tensor([0.7379, 0.7379], requires_grad=True)
tensor([0.7903, 0.7903], requires_grad=True)
tensor([0.8322, 0.8322], requires_grad=True)
tensor([0.8658, 0.8658], requires_grad=True)

which is gradually converging to the correct answer, but isn’t there yet.

Nonlinear Regression

Let’s try fit a mix of sine waves to some data. First generate the data:

import torch

# Make some data
N = 1000
x = 2 * torch.pi * torch.rand(N, requires_grad=False)
y = 1 * torch.sin(1 * x) #+ 2 * torch.sin(0.7*x)

Next let’s construct a model:

# Model parameters: a0 * sin(a1 * x)
# Correct final output is [1, 1]
params = torch.tensor([1.3, 1], requires_grad=True)
# model = params[0] * torch.sin(params[1] * x)

We can then do gradient descent:

import torch

# Make some data
N = 1000
x = 2 * torch.pi * torch.rand(N, requires_grad=False)
y = 1 * torch.sin(1 * x) #+ 2 * torch.sin(0.7*x)

# Model parameters: a0 * sin(a1 * x)
# Correct final output is [1, 1]
params = torch.tensor([1.3, 1], requires_grad=True)

# Fit the model with gradient descent
gradient_size = 0.1

for i in range(40):
	model = params[0] * torch.sin(params[1] * x)
	loss = ((y - model)**2).sum() # Construct loss
	loss.backward() # Backprop
	print(loss,params.grad/sum(params.grad**2)**0.5)
	with torch.no_grad(): # Don't track gradients in this step
		params -= gradient_size * params.grad / (1 + sum(params.grad**2)**0.5) # Gradient descent
		params.grad = None # Zero out for next iteration

loss = ((y - model)**2).sum() # Construct loss
print(loss)
print(params)

Note that because the gradients are large in this case (due to the large amount of data going into the loss) I’ve normalized it so we take a step of at most size gradient_size in parameter space each time.

This prints output like:

tensor(44.5198, grad_fn=<SumBackward0>) tensor([ 0.8813, -0.4726])
tensor(37.3011, grad_fn=<SumBackward0>) tensor([0.2550, 0.9669])
tensor(40.3255, grad_fn=<SumBackward0>) tensor([ 0.2711, -0.9625])
tensor(27.7070, grad_fn=<SumBackward0>) tensor([0.1999, 0.9798])
tensor(32.0618, grad_fn=<SumBackward0>) tensor([ 0.2271, -0.9739])
tensor(21.7022, grad_fn=<SumBackward0>) tensor([0.1519, 0.9884])
tensor(27.0113, grad_fn=<SumBackward0>) tensor([ 0.1877, -0.9822])
tensor(18.0849, grad_fn=<SumBackward0>) tensor([0.1108, 0.9938])
tensor(24.0175, grad_fn=<SumBackward0>) tensor([ 0.1539, -0.9881])
tensor(15.9776, grad_fn=<SumBackward0>) tensor([0.0767, 0.9971])
tensor(22.2906, grad_fn=<SumBackward0>) tensor([ 0.1262, -0.9920])
tensor(14.7849, grad_fn=<SumBackward0>) tensor([0.0491, 0.9988])
tensor(21.3164, grad_fn=<SumBackward0>) tensor([ 0.1041, -0.9946])
tensor(14.1264, grad_fn=<SumBackward0>) tensor([0.0274, 0.9996])
tensor(20.7759, grad_fn=<SumBackward0>) tensor([ 0.0869, -0.9962])
tensor(13.7711, grad_fn=<SumBackward0>) tensor([0.0107, 0.9999])
tensor(20.4790, grad_fn=<SumBackward0>) tensor([ 0.0738, -0.9973])
tensor(13.5837, grad_fn=<SumBackward0>) tensor([-0.0020,  1.0000])
tensor(20.3159, grad_fn=<SumBackward0>) tensor([ 0.0640, -0.9979])
tensor(13.4877, grad_fn=<SumBackward0>) tensor([-0.0115,  0.9999])
tensor(20.2253, grad_fn=<SumBackward0>) tensor([ 0.0568, -0.9984])
tensor(13.4409, grad_fn=<SumBackward0>) tensor([-0.0185,  0.9998])
...
tensor(13.4534, grad_fn=<SumBackward0>) # loss
tensor([0.9829, 0.9430], requires_grad=True) # parameters

Initially gradient descent works well, then it gets close to the correct answer and the gradient with respect to the sine period (second parameter) come to dominate, causing behaviour where it flips back and forth between two nearly-correct periods. It might take a more sophisticated method (e.g. momentum) to do better here.

Datasets

PyTorch comes with functions for downloading commonly-used datasets. For instance the following downloads MNIST (images of handwritten digits, with corresponding labels) and separates it into training and test sets:

import torch
from torch.utils.data import Dataset
from torchvision import datasets
from torchvision.transforms import ToTensor

training_data = datasets.MNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

test_data = datasets.MNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)

The API here seems to mostly be dataset-specific, but many datasets share common options (e.g. FashionMNIST and MNIST have the same interface).

Inside, the MNIST dataset is a list of tuples of the form (image tensor, image label), where the label is an integer and the tensor is a two-dimensional array of shape (1,28,28).

Example: Simple Neural Network for MNIST

(Following https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html)

This combines some linear layers with some ReLU layers, then outputs a score for each possible label (which we’ll interpret as a probability).

import torch.nn as nn

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
        	# Notice the dimension convention: input is on the left, output is on the right.
            nn.Linear(28*28, 512), # 28x28 is the MNIST image size
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(), # element-wise operation
            nn.Linear(512, 10),
						nn.Softmax(dim=1), # Softmax turns this from logits into probabilities.
				)

    def forward(self, x):
        x = self.flatten(x)
        probabilities = self.linear_relu_stack(x)
        return probabilities

Neutral networks need to be pushed to the device (though I don’t fully understand what’s happening here…)

# Make an instance of the neural network, and move it to the device
device = torch.device("cpu")
model = NeuralNetwork().to(device)
print(model)

Now set up for training:

### Setup for training

# hyperparams
epochs = 5 # number of times we iterate over the data
learning_rate = 1e-3 # gradient damping
batch_size = 64 # number of samples to use in a batch

# We'll train in batches
train_dataloader = DataLoader(training_data, batch_size=batch_size)
test_dataloader = DataLoader(test_data, batch_size=batch_size)

# Loss
loss_fn = nn.CrossEntropyLoss()

# Optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

and construct the training and testing functions:

def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    for batch, (X, y) in enumerate(dataloader):
        # Compute prediction and loss
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad() # Zero's out gradients (remember .backward *accumulates*)
        loss.backward() # calculate new gradients
        optimizer.step() # update parameter weights with gradient descent

        if batch % 100 == 0:
            loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")


def test_loop(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    test_loss, correct = 0, 0

    with torch.no_grad(): # No reason to compute gradients in testing
        for X, y in dataloader:
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()

    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

Then we train!

for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, model, loss_fn, optimizer)
    test_loop(test_dataloader, model, loss_fn)

Here’s the first Epoch of output:

loss: 2.302563  [    0/60000]
loss: 2.301651  [ 6400/60000]
loss: 2.302801  [12800/60000]
loss: 2.302033  [19200/60000]
loss: 2.302732  [25600/60000]
loss: 2.302280  [32000/60000]
loss: 2.302579  [38400/60000]
loss: 2.302456  [44800/60000]
loss: 2.301929  [51200/60000]
loss: 2.302089  [57600/60000]
Test Error: 
 Accuracy: 15.4%, Avg loss: 2.302021

It’s interesting that the testing accuracy keeps going up through training (from 15% to 30%) even though the loss never budges.

Experiment: Randomizing the model parameters

The model parameters are stored in a state dictionary, which is access through getter and setter method. So we can randomize the model parameters via e.g.

state_dict = model.state_dict()
for key in state_dict.keys():
	state_dict[key] = torch.rand(state_dict[key].shape)
model.load_state_dict(state_dict)

Doing this leads to much worse training behaviour. I’m not sure why, but the loss never gets down as low as 2.3 and the test accuracy never rises above 10.3%.

Experiment: Faster learning and more Epochs

What happens if we up the learning rate to 1e-2 and provide 10 epochs?

Epoch 1
loss: 2.297943  [57600/60000]
 Accuracy: 34.6%, Avg loss: 2.297844 
Epoch 5
loss: 1.916193  [57600/60000]
 Accuracy: 69.6%, Avg loss: 1.911707 
Epoch 10
loss: 1.654531  [57600/60000]
 Accuracy: 83.4%, Avg loss: 1.647756

Even from epoch 9 -> 10 things were still improving, which is cool.

How about a rate of 1e-1?

Epoch 10
loss: 1.520834  [57600/60000]
 Accuracy: 94.9%, Avg loss: 1.515659

Well that’s cool.

How about a rate of 1e0? [Surely this will fail?!]

Epoch 1
loss: 1.529209  [57600/60000]
 Accuracy: 93.0%, Avg loss: 1.532691 
Epoch 7
loss: 1.491918  [57600/60000]
 Accuracy: 97.5%, Avg loss: 1.486677 
Epoch 8
loss: 1.477088  [57600/60000]
 Accuracy: 96.9%, Avg loss: 1.492298 
Epoch 10
loss: 1.493404  [57600/60000]
 Accuracy: 97.6%, Avg loss: 1.485000

Some non-monotonicity, but still pretty good!

Experiment: Fewer layers

What if we take out some of the layers? (And leave the crazy learning rate of 1e0!)

self.linear_relu_stack = nn.Sequential(
# Notice the dimension convention: input is on the left, output is on the right.
  nn.Linear(28*28, 512), # 28x28 is the MNIST image size
  nn.ReLU(), # element-wise operation
  nn.Linear(512, 10),
nn.Softmax(dim=1), # Softmax turns this from logits into probabilities.
)

Well:

Epoch 1
loss: 1.564381  [57600/60000]
 Accuracy: 93.4%, Avg loss: 1.532022 
Epoch 5
loss: 1.498894  [57600/60000]
 Accuracy: 97.0%, Avg loss: 1.493585 
Epoch 10
loss: 1.478126  [57600/60000]
 Accuracy: 97.8%, Avg loss: 1.484777

Well that’s fun. The simpler model does better.

Experiment: Fewer layers and Narrower/Wider Layers

How about we take those layers out and make the remaining ones narrower?

nn.Linear(28*28, 100), # 28x28 is the MNIST image size
nn.ReLU(), # element-wise operation
nn.Linear(100, 10),
nn.Softmax(dim=1), # Softmax turns this from logits into probabilities.

Well…

Epoch 1
loss: 1.563033  [57600/60000]
 Accuracy: 92.8%, Avg loss: 1.536991 
Epoch 5
loss: 1.519344  [57600/60000]
 Accuracy: 96.6%, Avg loss: 1.497399 
Epoch 10
loss: 1.491755  [57600/60000]
 Accuracy: 97.3%, Avg loss: 1.488801

So that’s a little worse. How about if we bump the width back up to 200?

Epoch 1
loss: 1.541748  [57600/60000]
 Accuracy: 93.0%, Avg loss: 1.535781 
Epoch 5
loss: 1.507868  [57600/60000]
 Accuracy: 96.4%, Avg loss: 1.500574 
Epoch 10
loss: 1.482715  [57600/60000]
 Accuracy: 97.5%, Avg loss: 1.487760

Still not quite as good as the wider layers. Does that mean we can do better with even wider ones than we started with? Say 1024?

Epoch 1
loss: 1.557399  [57600/60000]
 Accuracy: 93.6%, Avg loss: 1.529462 
Epoch 5
loss: 1.507075  [57600/60000]
 Accuracy: 97.1%, Avg loss: 1.491251 
Epoch 10
loss: 1.479086  [57600/60000]
 Accuracy: 97.6%, Avg loss: 1.486674

Hmmmm. That’s a little worse than when we had 512.

GPU

Trying this on an instance with an A100, but got a pytorch error saying it didn’t support the A100. Some googling suggested trying

pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

should work. I had some trouble getting this to work and ended up backing up and making a conda environmen following this Stackoverflow answert:

conda create -n meta_learning_a100 python=3.9
conda activate meta_learning_a100

pip3 install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

and that worked.

Then I hit this error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking arugment for argument mat1 in method wrapper_addmm)

The issue is that the training and testing data don’t live on the GPU. We can fix that:

        X = X.to(device)
        y = y.to(device)
        pred = model(X)

That did the trick. Let’s try training with a width of 5000:

nn.Linear(28*28, 5000), # 28x28 is the MNIST image size
nn.ReLU(), # element-wise operation
nn.Linear(5000, 10),
            nn.Softmax(dim=1), # Softmax turns this from logits into probabilities.
    )

This gives

Epoch 1
loss: 1.546615  [57600/60000]
 Accuracy: 94.1%, Avg loss: 1.523945 
Epoch 5
loss: 1.499782  [57600/60000]
 Accuracy: 97.4%, Avg loss: 1.489153 
Epoch 10
loss: 1.477566  [57600/60000]
 Accuracy: 98.0%, Avg loss: 1.483557

That’s better. If we put that on paperswithcode we’d rank between ProjectionNet and Tsetlin Machine.

Okay but what if the layers are really wide?

Now trying a width of 50,000. The accuracy in epoch 1 is bad (55%) but it improves fast after that, and reaches 98.1% by epoch 10. So not much improvement over the much narrower models.

Next, 5000 width, batch size of 256, and 20 epochs. Accuracy reaches 97.6%, gradually climbing with epoch.

Next, 50,000 width, batch size of 256, and 50 epochs. Accuracy reaches 98.2% after 27 epochs and sits there for the rest of training.

I think I’ll call it a day here.