A comment on the structure of this blog post

In general, I have attempted to follow a consistent structure in writing this blog - that is write a summary of each chapter from the Fastai Book with my commentary and, ideally, include a code sample relating to the topics under discussion.

This time I’m doing something different. The coding exercise is taking considerable time to work out so I will discuss what I have come up with so far rather than including a fully complete project. In the (hopefully not too distant) future I will post a full solution.

Code for my project can be found here: link

If you want to try it out yourself download data from here: link

Concept overview - The Weighting Game

This chapter consolidates the first couple of chapters, in particular how to structure a programme to implement Machine Learning.

In Chapter 4, Howard and Guggen focus on the example of a programme designed to recognise handwritten digits (just threes and sevens).

The challenge in authoring a programme that can understand handwriting is a well understood problem. Well at least after the Newton was released…

The MNIST database, which was authored in the late 1990s, was created for exactly this purpose. Using MNIST as a basis Howard and Guggen demonstrate how features from NumPy, PyTorch and of course Fastai, can be combined to solve this problem.

The crucial point is the programme must be written in such a way that it can improve itself.

To make this point explicit, Howard and Guggen begin by intentionally writing a somewhat successful, but incomplete programme. Their first idea is to look at thousands of examples of handwritten threes and sevens, to build up an ‘ideal’ concept of a three or seven, and then compare this ideal to new samples. If the sample under review is closer to a seven it is reasonable to predict it is a seven and likewise with a three.

I am glossing over exactly how a programme can derive this ideal, but their main point is that this programme will not be as successful as one that can take advantage of Machine Learning. The problem is that there is no opportunity for the programme to become more accurate without changing the training or sample data. Assuming the data-set and test sample remain the same, the accuracy of any predictions made will not increase, it will always be the same.

As an alternative Howard and Guggen then show how we might restructure this as a Machine Learning problem. Essentially we want to introduce a way for the model to become progressively more accurate’ although not so accurate that it has been overfitted to the training set.

This is achieved by structuring the data so that a weight can be applied to it and also adjusted so that predictions can become more accurate over time.

Initially the weight is likely to lead to inaccurate results - in fact, it can be an entirely random value at first. But through calculus we can review the loss i.e. how inaccurate the prediction is, of a particular weight, adjust this weight, test again and carry on in this way until we believe the model yields the most accurate result it can, bearing in mind the problem of overfitting.

The main thing here is we now have a situation where a programme can adjust the weight, or ‘learn’, and make more accurate predictions.

This is the conceptual structure as I understood it. Once again I am glossing over the exact mechanics of how this model is encoded but I will discuss this in the next section where I will discuss my coding sample.

Organising the data or why I need a GPU…

Turning to my coding project, my goal is to extend Howard and Guggen’s example by creating a model that could make predictions for all numbers from 0 - 9.

Working through this for a few weeks I’ve been surprised to find that the most challenging aspect is not the actual Mathematics - as it happens, I have yet to perform any Math operations on my data - but rather sourcing and organising the data in such a format that I could implement the calculations.

To begin with, although the MNIST data is freely available online, being written more than two decades ago, it is in a data format that I could not possibly begin to decipher. Fortunately, I was able to find an alternative source on Kaggle that presents the same data in CSV format.

I don’t believe using this source goes against the spirit of the exercise as the point is to understand how Machine Learning works today and not the late 90s. However, if you want a real challenge the data is out there so give it a try.

The CSV I was able to obtain is quite different from the sources Howard and Guggen used. Their solution appears to contain jpeg images of 3s and 7s, accompanied by a CSV file labelling each image with its corresponding number label.

From what I understand they could use PyTorch to create tensors of ‘stacked’ threes or sevens (similar to the ideal notion mentioned above) and then apply a weight and bias to these tensors. The weights are then adjusted at regular increments until a sufficient level of accuracy has been achieved.

Though not 100% accurate, the model would be overfitted where that the case, they are nevertheless much more accurate than the ideal concept mentioned above.

Turning to my data, there are no images included with the CSV file rather it consists of a column of number labels (from 0-9) and the pixel values, written as one long column. Think of it as taking the 28x28 image and then rearranging it into one long list of pixel values.

CSV representation of Handwritten images

CSV representation of Handwritten images

Visualisation of number from data set

You can see my attempt to wrangle this data in this Google colab notebook. Essentially I have been able to write a function that can rearrange the data into dictionaries that contain the number labels as keys and the pixel values as values. I had hoped to go through the entire CSV file, separate the labels and pixel values in this way into a list, which I could then use to merge the pixel values for each number ultimately into stacked tensors and then perform the math operations on them.

However, the problem is that generating a list of dictionaries for 6000 entries is not performant.

At this point, I will have to go back and reevaluate my approach. I suspect if I can arrange the CSV file as an array or tensor earlier and manipulate them I may be able to get around this performance roadblock since array or tensor operations use a C API and can be done with GPUs.

I will update this post once I’ve attempted this.

Chapter Lecture