One of my pet peeves is building things from scratch. I guess that's my way of truly mastering a subject. So, I wrote a deep learning framework in NumPy, and in this series, I'll share what I learned.
In the first episode (Introduction to Computational Graphs), we’ve learned
what computational graphs are,
and how they are constructed in practice.
Now, we continue the journey by using computational graphs to train neural networks. To follow along with the code, check out notebook one (the linear regression part) and notebook two (the neural network part)!
Training neural networks
Before diving deep into the details of computational graphs, we'll learn how to work with them. I know, you are probably eager to smash out your own backpropagation implementation, but it's best to hold your horses. Mastering technical concepts happens in three stages:
understanding what you’re working with,
learning how to use it,
and finally, exploring how it works internally.
So, it's time to put computational graphs to work. Let's train some models!
Similarly to classical machine learning models like k-nearest neighbors or random forests, we want to build a convenient interface.
Let's go through the methods one by one:
Linear.forward
, describes the computational graph that define the model,Linear.__call__
is a shortcut toLinear.forward
,Linear.parameters
, stores the model parameters in a dictionary,and
Linear.gradient_update
performs a step in gradient descent.
As you can see, there are several downsides of this implementation. For instance, Linear.parameters
requires us to manually add all parameters one by one. Trust me, we'll fix these early issues in due time.
For now, let's see Linear
in action!
Let’s check the parameters of the model:
This is a good time to highlight that scalar operations work with vanilla number types, like integers, floats, whatever. It's just for convenience, saving us from typing Scalar(...)
all the time.
To train this simple model, we'll generate a toy dataset from the target function
h(x) = 0.8 x - 1.2.
Here’s the plot of our data.
We'll also need a loss function as well. Let's go with the simplest one: the mean squared error, defined by the formula
where 𝐱 ∈ ℝᴺ is the vector of predictions, while 𝐲 ∈ ℝᴺ is the vector of ground truths. Here's the implementation.
Again, notice that this is a vanilla Python function. Still, it can operate on our Scalar
objects, turning computations into graphs.
We already have everything ready to train our model with gradient descent!
Here's the result.
Looks good! The parameters seem close to the target function h(x) = 0.8x - 1.2. Here's the plot.
Saving and loading parameters
Let's go back to square one. What if we need to interrupt our training, to be continued later? Or load a pre-trained model?
This is an everyday circumstance in machine learning, especially with models of billions of parameters. Because of that, we'll add a couple of methods to save and load weights.
(Note. For the sake of clarity, we implement classes iteratively, sometimes method by method. I wrote this post in Jupyter Notebook, and to avoid redundancy, I subclass instead of repeating the full class definition.)
Let’s try them out right away!
Now, load some new parameters.
Keep in mind that in Python, Scalar
objects are stored by reference, unlike, say, a Python float
. This means that if you save the parameter dictionary and then use gradient descent to tune the parameters, your saved parameter dictionary will also change. Take a look.
To avoid issues from this, we can use the Linear.copy_parameters
method.
Parameter saving and loading are useful if you want to visualize the model's state during training. Here's an example demonstrating how gradient descent fits a linear regression model.
Here’s a visualization of the learning process.
It's time to see an actual neural network!
Building a neural network
We start with a simple binary classification problem to demonstrate how neural networks work. Here's our (generated) dataset of two classes, encoded with 0 and 1.
We'll start with the simplest neural network: zero hidden layers and sigmoid activation. This is also known as logistic regression. Here we go:
Here are the decision boundaries of the unknown model, plotted on top of our dataset.
The initial model gets nothing right, so let's train it!
Here's how the model performs after training. Judging from the loss values, it's pretty good.
Solving a simple problem like that is no big deal. Can we handle more complex datasets?
A multi-layer network
Here's a spiral-like dataset with classes intertwined into each other. (Feel free to skip the generating code; it's irrelevant to our purposes. Hell, I even confess that it was generated by ChatGPT.)
A logistic regression model won't cut it this time. To solve this classification problem, we need a hidden layer. Here's a model with a hidden layer of eight neurons, connected via the
activation function.
We can already see one of the glaring flaws of our `Scalar` implementation of computational graphs: the inability to write vectorized code. For instance, the expression
fs = [sum([self.A[i][j] * x[i] for i in range(2)]) for j in range(4)]
is simply the matrix product of the input x
and the 2 × 4 parameter matrix A
.
We'll deal with vectorization later with the Tensor
class, but let's stick to the vanilla version for now. Here's our model, and here’s how it looks untrained.
Let's train it! We'll need quite some more steps. To spice things up, we'll also use a simple learning rate tuning: lr=1
for the first hundred gradient descent steps, lr=0.5
for the second hundred, and lr=0.1
after.
This training took a while to execute on my Lenovo Thinkpad. Again, this is the consequence of non-vectorized code. Is the model any good? Let's see.
Not perfect at all, but we can already see that the decision boundary is starting to conform to the data. We need a more expressive model and more training iterations. This foreshadows the need for more effective code, which we'll bring to fruition with vectorization.
But let's not get ahead of ourselves! Next time, we’ll see how Scalar
works on the inside. Trust me on this: understanding plain scalar-valued computational graphs is paramount to building hyper-fast vectorized ones. We dial up the difficulty one notch at a time, and right now, the next step is digging deep into the forward pass.