Comment on CH4 - Part 2 | Quick Learning

Intro

A follow up post to my review of Chapter 4 of Deep Learning for Coders. As a reminder, my goal was to create a model that could recognise handwritten digits (0-9) based on the MNIST data set. I feel that I have been partially successful in meeting my goal so an update seems apt.

I say partially as my initial idea was to, as far as possible, write a solution with my own code. The solutions I have here (there are two) are based on example using PyTorchmodules.

Nevertheless, I think this discussion will prove illuminating. First, as PyTorch is a well-supported Python module - you would do much better plugging it in your code than using any solution I wrote for my own learning. Second, in comparing the PyTorch based solutions with my initial attempt I feel I have learnt a great deal more about data structures. I believe any reader who wants to understand Deep Learning should also consider this.

With this update post then I will briefly comment on where my initial attempt went wrong, how the solutions I have work, and end with some thoughts on where I could go from here with this project.

Rather than go over the details of what the MNIST database is and how I used it, if you have not already, please review my first post and you can see my code for it here: link

Also this post is going to be far more jargon heavy than my typical write-ups. I would like to eventually write a glossary for the terms used here (and elsewhere in my blog for that matter). But for now I’ve included links to relevant external sites.

The initial attempt

In summary, my initial attempt failed because I could not structure the data in such a way for a model to perform the mathematical operations required to predict what number a handwritten digit represents.

While I could make arrays of the pixel values for the numbers and associate these arrays with their label in a dictionary - the label being the key and the pixels values the values. I could not find a way to parse the CSV file and merge all the arrays of pixel values per their digit label i.e. 0-9.

What PyTorch brings is several built-in techniques that can arrange the data such that said operations work.

The solutions

Currently I have two programmes that generate models capable of recognising handwritten digits. The first one uses the MNIST database that PyTorch includes natively. It is based on the tutorial they provide here. Initially, I had thought to just discuss this solution but then I came across this alternative take: link, which I then adapted. This version is based on the same MNIST CSV file I used in previous attempt. I think it is worth comparing these two as, somewhat surprisingly, the second programme yielded a more accurate model that requires few epochs of training.

I am puzzled by this as PyTorch’s own solution, which actually uses original image files, is less accurate than the second solution, which uses numbers representing the grey scale pixel hues of these images and then recombines into images. I’ll speculate on why this might be the case in the next section.

accuracy of solution 1 after 20 epochs training

accuracy of solution 2 after 3 epochs training

In any case, the real point here is to highlight why PyTorch works. Well, it works due to two features, namely that it can generate tensors and neural networks.

Tensors are a longstanding mathematical concept but in this context they are extremely similar to arrays just with some additional restrictions on their structure.

In the second solution, the programme works where my initial attempt didn’t as it stores training data in the form of a tensor rather than an array. Then, in both this and the first solution, the programme uses PyTorch’s DataLoader function to put tensors into batches that can be feed to the GPU for calculation. This is important as analysing individual tensors and labels would be extremely inefficient, although batch sizes cannot be too large either as this decreases accuracy.

With the Neural Networks, this enables the model to adjust the data by ‘weighting’ elements (for more information on this see my comments on CH1). By adjusting how the values are weighted, the model becomes either more or less accurate (obviously we want it to become more accurate).

Finding the most accurate weighting for the model is achieved through differentiation, by minimising the loss i.e. improving the accuracy by minimising the inaccuracy.

I’m glossing over a significant amount of the detail here. Again at some point I may write a glossary to explain what is going on. But for now I would encourage readers to checkout my code examples mentioned above and the various tutorials PyTorch have.

Final thoughts and next steps

One thing I do find odd is why the second model seems more accurate. Essentially it takes the original images, creates a CSV file of all the greyscale pixel hues of each image and the recombines these into images. Other than that the process is the same, yet this more convoluted approach appears to create a more accurate model.

My speculation here is that in converting the images into a CSV with discrete numbers, and the making new images from this file, a clearer overall image is made. Since the testing data is drawn from the same source, having images based on descrete numbers could create more precisie images for the training set.

It would be interesting then to see how these models perform in the real-world. I suspect the second model may become less accurate when tested on real life data since there is likely to be significant variation in handwriting, the writing implements use (ball pens, fountain pens, pencils) and image sensors photographing and processing the handwritten digits.

As for the next steps, I would encourage the reader to check out PyTorch’s material. Not only does their solution work but I suspect that even if I were to write one myself it would mostly like not be as user friendly.

However, I do feel that writing a solution myself would be an incredible learning exercise, possibly also requiring me to learn a more low level programing language.

So watch this space, I can’t see myself undertaking this exercise anytime soon but who knows what the future holds.