Working Notes - Learning PyTorch - Day 2

Apr 12, 2022 17:17 · 2792 words · 14 minute read Machine Learning AI

Summary

Adding convolutional layers is pretty straightforward, but there are some indexing subtleties to watch.
Pooling layers can really improve performance.
With two pooling layers, I found an example of a too-high learning rate making the model get stuck with poor performance.
The more pooling layers I used the faster the model took a training step (because the dense linear layer was smaller).
The more pooling layers I used, the more important the number of channels in the convolutional/pooling layers became.
Working with text requires a way to turn that text into vectors in a continuous space, and one-hot vectors are one solution to that problem.
RNN’s let you study sequences, and accumulate information about the sequence as they ‘sweep’ through it.
Even though the processing to form the hidden state is linear, there’s a notion of causality in an RNN, because earlier entries get passed through the linear operator that forms the hidden state more times than late entries (and each letter only gets combined with the subsequent letters in determining its contribution to the hidden state).

Convolutional Neural Networks

I’d like to try MNIST with a CNN.

PyTorch has a 2D convolution layer (here):

nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding="same"),

This takes something of shape (N, in_channels, height, width) and returns something of shape (N, out_channels, height, width). (The input and output height and width match because we chose padding="same"… with other choices they can be different.)

The MNIST input has shape (N,1,28,28), so this’ll work just fine.

The one change we need is to increase the size of the linear layer from 28*28 to out_channels*28*28, and we need to flatten the output of the Conv2D layer to provide an input to the linear layer (note that nn.flatten() ignores the first dimension, assuming that’s the batch/sample dimension, which is exactly what we need here). The resulting spec is:

nn.Conv2d(1, 4, kernel_size=3, stride=1, padding="same"), 
nn.flatten(),
nn.Linear(4*28*28, 512), # 28x28 is the MNIST image size
nn.ReLU(),
nn.Linear(512, 10),
nn.Softmax(dim=1), # Softmax turns this from logits into probabilities.

Also we need to remove the flatten operation that used to wrap around all of this/come first.

With that, the full spec is:

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.stack = nn.Sequential(
        	# Notice the dimension convention: input is on the left, output is on the right.
            # First is # of channels, 1 because Black and White image
            # Second is the output width, we pick 4 as a starting guess.
            # Third is the spatial width of the kernel and fourth is the amount to pad the input ("same" makes the output shape match the input shape).
            nn.Conv2d(1, 4, kernel_size=3, stride=1, padding="same"),
            nn.Flatten(),
            nn.Linear(4*28*28, 512), # 28x28 is the MNIST image size
            nn.ReLU(),
            nn.Linear(512, 10),
			nn.Softmax(dim=1), # Softmax turns this from logits into probabilities.
		)

    def forward(self, x):
        probabilities = self.stack(x)
        return probabilities

Subjectively this runs a bit slower than without the convolutional layer, but I haven’t timed it carefully to be sure (and on my laptop thermals could make a comparable difference…).

What about test/train accuracy/loss?

Epoch 1
loss: 1.523330  [57600/60000]
Test Accuracy: 90.7%, Avg loss: 1.553831 

Epoch 2
loss: 1.511761  [57600/60000]
Test Accuracy: 94.8%, Avg loss: 1.513175 

Epoch 3
loss: 1.523650  [57600/60000]
Test Accuracy: 94.2%, Avg loss: 1.518331 

Epoch 10
loss: 1.538825  [57600/60000]
Test Accuracy: 91.2%, Avg loss: 1.548623

This is a lot worse than what we had without the conv layer!

Pooling

What about pooling (to get rid of unimportant small-scale details)? We can do that with

  nn.Conv2d(1, 4, kernel_size=3, stride=1, padding="same"),
  nn.MaxPool2d(2, stride=2, padding=0), # Reduces the spatial dimensions each by half
  nn.Flatten(),
  nn.Linear(4*14*14, 512), # 28x28 is the MNIST image size

The MaxPool2d layer returns the maximum of each $k\times k$ block. This is done as a convolution, but if we set the stride to $k$ we can make the blocks non-overlapping, reducing the spatial scale by a factor of $k$.

This runs faster, and seems to do better too!

Epoch 1
loss: 1.514428  [57600/60000]
Test Accuracy: 93.6%, Avg loss: 1.524622 

Epoch 2
loss: 1.522758  [57600/60000]
Test Accuracy: 96.2%, Avg loss: 1.498265 

Epoch 5
loss: 1.492536  [57600/60000]
Test Accuracy: 96.4%, Avg loss: 1.496530 

Epoch 10
loss: 1.476787  [57600/60000]
Test Accuracy: 97.8%, Avg loss: 1.482338

I’m guessing this works by imposing a prior that some of the small details don’t matter, and possibly by reducing the number of parameters in the linear layer, but I’m not really sure.

Given that that worked, what if we do it again?

nn.Conv2d(1, 4, kernel_size=3, stride=1, padding="same"),
nn.MaxPool2d(2, stride=2, padding=0),
nn.Conv2d(4, 4, kernel_size=3, stride=1, padding="same"),
nn.MaxPool2d(2, stride=2, padding=0),
nn.Flatten(),
nn.Linear(4*7*7, 512), # 28x28 is the MNIST image size, but we cut that in 4 in each dimension with the pooling layers
nn.ReLU(),
nn.Linear(512, 10),
nn.Softmax(dim=1), # Softmax turns this from logits into probabilities.

That gives:

Epoch 1
loss: 1.556487  [57600/60000]
Test Accuracy: 93.4%, Avg loss: 1.527201 

Epoch 2
loss: 1.595740  [57600/60000]
Test Accuracy: 92.7%, Avg loss: 1.534411 

Epoch 3
loss: 2.242401  [57600/60000]
Test Accuracy: 10.3%, Avg loss: 2.357847

That went off the rails, and doesn’t recover in later epochs. Could it be a learning rate issue? What happens if we lower the learning rate to 1e-1?

Epoch 10
loss: 1.507247  [57600/60000]
Test Accuracy: 97.6%, Avg loss: 1.486762

Much better! And the behaviour of the test accuracy was monotonic too, which is nice.

How about a rate of 3e-1?

Epoch 10
loss: 1.485193  [57600/60000]
Test Accuracy: 97.8%, Avg loss: 1.482973

Even better! I’ll still with 3e-1 for now.

Even More Pooling

How far does pooling scale? Let’s add another conv2d layer and another maxpool layer. While we’re at it, let’s reorganize a little to make the structure clearer:

self.conv_stack = nn.Sequential(
  # Notice the dimension convention: input is on the left, output is on the right.
    # First is # of channels, 1 because Black and White image
    # Second is the output width, we pick 4 as a starting guess.
    # Third is the spatial width of the kernel and fourth is the amount to pad the input ("same" makes the output shape match the input shape).
    nn.Conv2d(1, 4, kernel_size=3, stride=1, padding="same"),
    nn.MaxPool2d(2, stride=2, padding=0),
    nn.Conv2d(4, 4, kernel_size=3, stride=1, padding="same"),
    nn.MaxPool2d(2, stride=2, padding=0),
    nn.Conv2d(4, 4, kernel_size=3, stride=1, padding="same"),
    nn.MaxPool2d(2, stride=2, padding=0),
)

self.relu_stack = nn.Sequential(
    nn.Flatten(),
    nn.Linear(4*3*3, 4*3*3), # 28x28 is the MNIST image size, but we cut that in 4 in each dimension with the pooling layers
    nn.ReLU(),
    nn.Linear(4*3*3, 10),
    nn.Softmax(dim=1), # Softmax turns this from logits into probabilities.
)

Note that I made the linear layers much smaller… I don’t think there’s any point in having a linear layer with an output that’s wider internally than it’s input (but maybe I’m missing something?).

As an aside, I figured out 4x3x3 experimentally (just being lazy):

print(NeuralNetwork().conv_stack(torch.zeros((10,1,28,28))).shape)

This prints torch.Size([10,4,3,3]). Here 10 stands in for the batch size, so 4x3x3 is the size of the input to the linear layer.

This model takes training steps noticeably faster than the previous one, which I’m guessing is because the linear layer is much smaller. Training with a learning rate of 3e-1 gives non-monotone behavior and looks like it gets stuck, so I lowered the learning rate to 1e-1. That gives a similar result, with accuracy in both cases stuck in the mid-80% range.

So something is being thrown out by this aggressive pooling.

More Channels

What if we keep the many pooling layers but widen them (with more channels)? Let’s try 8 channels:

self.conv_channels = 8
self.conv_stack = nn.Sequential(
  # Notice the dimension convention: input is on the left, output is on the right.
    # First is # of channels, 1 because Black and White image
    # Second is the output width, we pick 4 as a starting guess.
    # Third is the spatial width of the kernel and fourth is the amount to pad the input ("same" makes the output shape match the input shape).
    nn.Conv2d(1, self.conv_channels, kernel_size=3, stride=1, padding="same"),
    nn.MaxPool2d(2, stride=2, padding=0),
    nn.Conv2d(self.conv_channels, self.conv_channels, kernel_size=3, stride=1, padding="same"),
    nn.MaxPool2d(2, stride=2, padding=0),
    nn.Conv2d(self.conv_channels, self.conv_channels, kernel_size=3, stride=1, padding="same"),
    nn.MaxPool2d(2, stride=2, padding=0),
)

self.relu_stack = nn.Sequential(
    nn.Flatten(),
    nn.Linear(self.conv_channels*3*3, self.conv_channels*3*3), # 28x28 is the MNIST image size, but we cut that in 4 in each dimension with the pooling layers
    nn.ReLU(),
    nn.Linear(self.conv_channels*3*3, 10),
    nn.Softmax(dim=1), # Softmax turns this from logits into probabilities.
)

With a learning rate of 3e-1, this gives much better performance, ending at 97% by Epoch 5.

If we go to 9 channels the performance is worse, at 89% after epoch 5. I’m worried this is just a learning rate issue generating noise in the output though. So I’m trying again with a learning rate of 1e-1:

Channels	Learning Rate / End Epoch	End Epoch Test Accuracy
5	3e-1 / 5	85%
6	3e-1 / 5	96%
8	3e-1 / 5	97%
9	3e-1 / 5	89%
9	1e-1 / 10	89%
16	3e-1 / 5	98%

I’m confused as to what’s going on here with width 9. Why should its performance be worse than with 8? But think I should probably move on.

Naively I thought 16 wouldn’t do better than 9. I figured there are only 9 inputs to the convolutional layers, so having more than 9 output channels wouldn’t help. But that’s not right: the first conv2d layer has a kernel mapping input (1,3,3) to output (9,) but the next one maps input (9,3,3) to output (9,), so there is some summarizing happening and that can be alleviated by adding more channels.

Recurrent Neural Networks

RNN’s are used to analyze sequences. Like CNN’s get ‘swept’ over images, RNN’s get ‘swept’ over a sequence. A key difference is that RNN’s accumulate their results as they go (they have an internal state that gets modified as they go).

Example: PyTorch Tutorial on Name Classification

Link

The basic idea is we’re going to treat names as sequences of characters. We’ll sweep an RNN over it and interpret the final output as the probability distribution over languages-of-origin.

I’ll omit the data loading bits copied directly from the tutorial. Those are mostly about parsing the data into a dictionary, category_lines, keyed by the languages and containing arrays of names from those languages.

The way we’ll represent data is as a ‘one-hot vector’, which has length equal to the number of possible characters and a 1 in the spot corresponding to the actual character that appears (zeros everywhere else). So e.g. a -> (1,0,0,...), b -> (0,1,0,...), and so on.

Words/names are then shaped as (length, 1, n_possible_characters). The extra 1 is interesting, but I think is meant to allow for feature channels later on. We parse into this shape using:

def lineToTensor(line):
    tensor = torch.zeros(len(line), 1, n_letters)
    for li, letter in enumerate(line):
        tensor[li][0][letterToIndex(letter)] = 1
    return tensor

The RNN we’ll use (following the tutorial) is formed of a linear operator that mixes the input and hidden state into an output and a new hidden state. There’s then a softmax applied to the output (but not the hidden layer). This is important: the hidden state just accumulates linearly, which prevents gradients from decaying with ‘distance’ from the training loss (e.g. we don’t want the gradient with respect to coefficients that touch the first letter to vanish, because then the loss can’t be made smaller by adjusting the response to that first letter).

It’s easier to code the linear layer as two separate linear operators, one producing the output and one updating the hidden state, so we write it as

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()

        self.hidden_size = hidden_size

        self.i2h = nn.Linear(input_size + hidden_size, hidden_size) # Updates hidden
        self.i2o = nn.Linear(input_size + hidden_size, output_size) # Makes output
        self.softmax = nn.LogSoftmax(dim=1) # Only applied to output

    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        hidden = self.i2h(combined)
        output = self.i2o(combined)
        output = self.softmax(output)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

(this is just following the tutorial)

By using LogSoftmax we’re making the output live in (-inf,0), which makes it natural to think of the output as the log-probability of each category.

The way we’d do the sweep then is:

input = lineToTensor('Name')
hidden = torch.zeros(1, n_hidden)

for c in input:
	output, next_hidden = rnn(c, hidden)
	hidden = next_hidden

The tutorial suggests using a loss called NLL, which stands for negative log-likelihood (which fits with our interpretation of the output as log-odds). If we train with this loss we’ll be making a model that outputs a log-probability of each class. That’s fine, so our training code looks like:

### Training
criterion = nn.NLLLoss()
learning_rate = 0.005 # If you set this too high, it might explode. If too low, it might not learn

def train(category_tensor, line_tensor):
    hidden = rnn.initHidden()

    rnn.zero_grad()

    for i in range(line_tensor.size()[0]):
        output, hidden = rnn(line_tensor[i], hidden)

    loss = criterion(output, category_tensor)
    loss.backward()

    # Add parameters' gradients to their values, multiplied by learning rate
    for p in rnn.parameters():
        p.data.add_(p.grad.data, alpha=-learning_rate)

    return output, loss.item()

Now we can train. We’ll do this (per the tutorial) by picking random samples from the dataset and passing them through to train. That gives output like:

5000 5% (0m 3s) 2.4513 Richelieu / Irish ✗ (French)
10000 10% (0m 6s) 2.8062 Haanraats / Portuguese ✗ (Dutch)
15000 15% (0m 9s) 1.5918 Vasyukevich / Russian ✓
20000 20% (0m 12s) 1.0102 Tivoli / Italian ✓
25000 25% (0m 15s) 2.2062 Durand / Italian ✗ (French)
30000 30% (0m 18s) 1.7089 Silveira / Czech ✗ (Portuguese)
35000 35% (0m 21s) 1.9493 Taylor / Russian ✗ (Scottish)
40000 40% (0m 24s) 1.6953 Stoppelbein / Dutch ✗ (German)
45000 45% (0m 27s) 1.0181 Herten / Dutch ✓
50000 50% (0m 30s) 0.6284 Paszek / Polish ✓
55000 55% (0m 34s) 0.9961 Grozmanova / Czech ✓
60000 60% (0m 37s) 1.9475 Hofwegen / German ✗ (Dutch)
65000 65% (0m 40s) 0.7656 Johnstone / Scottish ✓
70000 70% (0m 43s) 2.3238 Pinho / Japanese ✗ (Portuguese)
75000 75% (0m 46s) 0.9659 Oomen / Dutch ✓
80000 80% (0m 49s) 0.0615 Khoury / Arabic ✓
85000 85% (0m 51s) 1.0259 Bertsimas / Greek ✓
90000 90% (0m 54s) 0.3386 Cloutier / French ✓

Incidentally, an important feature about this loss function (rather than e.g. a binary score based on whether the classification was correct or not) is that it’s differentiable (with non-zero derivatives), which allows backprop to gradually lower the loss. [This sounds obvious, and is, I just wanted to note it…]

Test Accuracy

Let’s build a system for evaluating test accuracy.

We do this by splitting off a test set ahead of time:

test_set = set()
while len(test_set) < num_test:
    test_set.add(tuple(randomTrainingExample()))

for ex in test_set:
    category,line,_,_ = ex
    category_lines[category].remove(line)
    
def test_accuracy():
	score = 0
	for ex in test_set:
	    category, line, category_tensor, line_tensor = ex

	    hidden = rnn.initHidden()
	    for i in range(line_tensor.size()[0]):
	        output, hidden = rnn(line_tensor[i], hidden)

	    ind = torch.argmax(output)
	    if category_tensor == ind:
	    	score += 1
	return score

Adding the score to the output (right after the loss) we find that it does improve with training!

5000 5% (0m 3s) 2.4365 29 Svejda / Japanese ✗ (Czech)
10000 10% (0m 6s) 1.5297 26 Kozlowski / Polish ✓
15000 15% (0m 9s) 2.0299 33 Alfero / Portuguese ✗ (Italian)
20000 20% (0m 12s) 1.6695 41 Sienkiewicz / Czech ✗ (Polish)
25000 25% (0m 15s) 2.6818 37 Papageorge / Irish ✗ (Greek)
30000 30% (0m 18s) 1.4853 46 Porto / Portuguese ✗ (Italian)

The score rises from 29 with no training to 46 after some brief training.

Experimenting with state size

The best test accuracy I got was in the mid-50%. Let’s vary the hidden state size a bit:

Size	Score after 100k training steps
32	55
64	53
128	49
256	53

Huh. No real trends here.

RNN with a non-linear step

What if we throw a nonlinear layer between the input and the hidden state? We need to keep the hidden state linear in the previous hidden state so the gradients don’t decay, so this takes a bit of rewriting:

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()

        self.hidden_size = hidden_size

        self.i2h = nn.Sequential(
                nn.Linear(input_size, hidden_size),
                nn.ReLU()
            )
        self.h2h = nn.Linear(hidden_size, hidden_size)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        hidden = self.i2h(input) + self.h2h(hidden)
        output = self.i2o(combined)
        output = self.softmax(output)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

Here there’s a layer i2h that takes the input and turns it into something that contributes to the hidden state (through a linear+ReLU stack), and there’s a layer h2h that takes the hidden state and linearly modifies that to form a contribution to the new hidden state. We add the outputs of these together to get the new hidden state.

This does similarly.

Size	Score after 100k training steps
32	49
256	58

What about with more nonlinearity?

self.i2h = nn.Sequential(
        nn.Linear(input_size, hidden_size),
        nn.ReLU(),
        nn.Linear(hidden_size, hidden_size),
        nn.ReLU()
    )

This gives similar performance, so I’m going to move on…