In computer science, there is a stark contrast between how we think about mathematical structures and how we represent them inside a computer. In mathematics, our goal is to develop a framework that enables us to reason about the structure of data and its transformations effectively.
We want a language that is
expressive,
easy to speak,
and as compact as possible.
With this under our belt, we can turn expressions such as ax + b into the building blocks of state-of-the-art neural networks.
However, our goals change when we do computations instead of pure logical reasoning. We want implementations that are
easy to work with,
memory-efficient,
and fast to access, manipulate, and transform.
These are often contradictory requirements, and particular situations might prefer one over the other. For instance, if we have plenty of memory but want to perform heavy computations, we can sacrifice size for speed. Because of all the potential use cases, there are multiple formats to represent the same mathematical concepts.
These are called data structures, and in machine learning, the single most important one is the array, an ubiquitous concept that covers vectors, matrices, and even tensors.
First, to understand what an array is, let’s understand the problems that invoke its existence.
Meet the tuples!
(Note: there’s a theoretical counterpart of this post. It’s not necessary to read that, though; both are self-contained. However, I do recommend giving it a read to enhance your understanding of vectorization.)
Tuples
Different programming languages implement arrays differently. Because Python is ubiquitous in data science and machine learning, it'll be our language of choice. There will be a ton of code, which you can follow/play around with in a Jupyter Notebook.
In standard Python, two built-in data structures can be used to represent vectors: tuples and lists. Let’s start with tuples! They can be simply defined by enumerating their elements between two parentheses, separating them with commas.
A single tuple can hold elements of various types. Even though we’ll exclusively deal with floats in computational linear algebra, this property is extremely useful for general purpose programming.
We can access the elements of a tuple by indexing. Just like in (almost) all other programming languages, indexing starts from zero. This is in contrast with mathematics, where we often start indexing from one. (Don’t tell this to anybody else, but it used to drive me crazy. I am a mathematician first.)
The number of elements in a tuple can be accessed by calling the built-in len
function.
Besides indexing, we can also access multiple elements by slicing.
Slicing works by specifying the first and last elements with an optional step size, using the syntax object[first:last:step]
.
Tuples are rather inflexible, as you cannot change their components. Attempting to do so results in a TypeError
, Python’s standard way of telling you that the object does not support the method you are trying to call (item assignment).
Besides that, extending the tuple with additional elements is also unsupported. As we cannot change the state of a tuple
object in any way after it has been instantiated, they are immutable. Depending on the use-case, immutability can be an advantage and a disadvantage as well. Immutable objects eliminate accidental changes, but each operation requires the creation of a new object, resulting in a computational overhead. Thus, tuples are not going to be optimal to represent large amounts of data in complex computations.
This issue is resolved by lists. Let’s take a look at them, and the new problems they introduce!
Lists
Lists are the workhorses of Python. In contrast with tuples, lists are extremely flexible and easy to use, albeit this comes at the cost of runtime. Similarly to tuples, a list
object can be created by enumerating its objects between square brackets, separated by commas.
Just like tuples, accessing the elements of a list is done by indexing or slicing.
We can do all kinds of operations on a list: overwrite its elements, append items, or even remove others.
This example illustrates that lists can hold elements of various types as well. Adding and removing elements can be done with methods like append
, push
, pop
, and remove
.
Before trying that, let’s quickly take note of the memory address of our example list, which can be accessed by calling the id
function.
This number simply refers to an address in my computer’s memory, where the v_list
object is located. Quite literally, as the snippets were made with my personal computer.
Now, we are going to perform a few simple operations on our list and show that the memory address doesn’t change. Thus, no new object is created.
Unfortunately, adding lists together achieves a result that is completely different from our expectations.
Instead of adding the corresponding elements together, like we want vectors to behave, the lists are concatenated. This feature is handy when writing general-purpose applications, but not that well-suited for scientific computations. “Scalar multiplication” also has strange results, as we are about to see.
Multiplying a list with an integer repeats the list by the specified number of times.
Given the behavior of the +
operator on lists, this seems logical as multiplication is repeated addition:
Overall, lists can do much more than we need to represent vectors. Although we potentially want to change elements of our vectors, we don’t need to add or remove elements from them, and we also don’t need to store objects other than floats. Can we sacrifice these extra features and obtain an implementation suitable for our purposes yet has a lightning-fast computational performance? Yes. Enter NumPy arrays.
NumPy arrays
Even though Python’s built-in data structures are amazing, they are optimized for ease of use, not for scientific computation. This problem was realized early on the language’s development and was addressed by the fantastic NumPy library.
One of the main selling points of Python is how fast and straightforward it is to write code, even for complex tasks. This comes at the price of speed. However, in machine learning, speed is crucial for us. When training a neural network, a small set of operations are repeated millions of times. Even a small percentage of improvement in performance can save hours, days, or even weeks in case of extremely large models.
Regarding speed, the C language is at the other end of the spectrum. C code is hard to write but executes blazing fast when done correctly. As Python is written in C, a tried and true method for achieving fast performance is to call functions written in C from Python. In a nutshell, this is what NumPy provides: C arrays and operations, all in Python.
To get a glimpse into the deep underlying issues with Python’s built-in data structures, we should put numbers and arrays under our magnifying glass. Inside a computer’s memory, objects are represented as fixed-length 0-1 sequences. Each component is called a bit. Bits are usually grouped into 8, 16, 32, 64, or even 128 sized chunks. Depending on what we want to represent, identical sequences can mean different things. For instance, the 8-bit sequence 00100110 can represent the integer 38 or the ASCII character “&”.
By specifying the data type, we can decode binary objects. 32-bit integers are called int32
types, 64-bit floats are float64
, and so on.
Since a single bit contains very little information, the memory is addressed by dividing it into 32 or 64 bit sized chunks and numbering them consecutively. This address is a hexadecimal number, starting from 0. (For simplicity, let’s assume that the memory is addressed by 64 bits. This is customary in modern computers.)
A natural way to store a sequence of related objects (with matching data type) is to place them next to each other in the memory. This data structure is called an array.
By storing the memory address of the first object, say 0x23A0
, we can instantly retrieve the k-th element by accessing the memory at 0x23A0 + k
.
We call this the static array or often the C array because this is how it is done in the magnificent C language. Although this implementation of arrays is lightning fast, it is relatively inflexible. First, you can only store objects of a single type. Second, you have to know the size of your array in advance, as you cannot use memory addresses that overextend the pre-allocated part. Thus, before you start working with your array, you have to allocate memory for it. (That is, reserve space so that other programs won’t overwrite it.)
However, in Python, you can store arbitrarily large and different objects in the same list, with the option of removing and adding elements to it.
In the example above, l[0]
is an integer so large that it doesn’t even fit into 128 bits. Also, there are all kinds of objects in our list, including a function. How is this possible?
Python’s list
provides a flexible data structure by
overallocating the memory,
and keeping memory addresses to the objects’ in the list instead of the objects themselves.
(At least in the most widespread CPython implementation.)
By checking the memory addresses of each object in our list l
, we can see that they are indeed all over the memory.
Due to the overallocation, deletion or insertion can always be done simply by shifting the remaining elements. Since the list stores the memory address of its elements, all types of objects can be stored within a single structure.
However, this comes at a cost. Because the objects are not contiguous in memory, we lose locality of reference, meaning that when we frequently access distant locations of the memory, our reads are much slower. Thus, looping over a Python list is not efficient, making it unsuitable for scientific computation.
So, NumPy arrays are essentially the good old C arrays in Python, with the user-friendly interface of Python lists. (If you have ever worked with C, you know how big of a blessing this is.) Let’s see how to work with them!
First, we import the numpy
library. (To save on the characters, it is customary to import it as np
.)
The main data structure is the np.ndarray
, short for n-dimensional array. We can use the np.array
function to create NumPy arrays from standard Python containers or initialize from scratch. (Yes, I know. This is confusing, but you’ll get used to it. Just take a mental note that np.ndarray
is the class, and np.array
is the function you use to create NumPy arrays from Python objects.)
We can even initialize NumPy arrays with random numbers. Most importantly, when we have a given array, we can initialize another one with the same dimensions using the np.zeros_like
, np.ones_like
, and np.empty_like
functions.
Just like Python lists, NumPy arrays support item assignments and slicing.
However, as expected, you can only store a single data type within each ndarray
. When trying to assign a string as the first element, we get an error message.
As you might have guessed, every ndarray
has a data type attribute that can be accessed at ndarray.dtype
. If a conversion can be made between the value to be assigned and the data type, it is automatically performed, making the item assignment successful.
NumPy arrays are iterable, just like other container types in Python.
Are these suitable to represent vectors? Yes. Let’s see why!
NumPy arrays as vectors
Let’s talk about vectors once more. From now on, we are going to use ndarray
-s to model vectors.
The addition and scalar multiplication operations are supported by default and perform as expected.
Because of the dynamic typing of Python, we can (often) plug in NumPy arrays into functions intended for scalars.
So far, NumPy arrays satisfy almost everything we require to represent vectors. There is only one box to be ticked: performance. To investigate this, we measure the execution time with Python’s built-in timeit
tool.
In its first argument, timeit takes a function to be executed and timed. Instead of passing a function object, it also accepts executable statements as a string. Since function calls have a significant computational overhead in Python, we are passing code rather than a function object in order to be precise with the time measurements. (We also want to exclude the object instantiation from our measurements, thus they are created during the setup.)
Below, we compare adding together two NumPy arrays vs. Python lists containing a thousand zeros.
NumPy arrays are much-much faster. This is because they are
contiguous in memory,
homogeneous in type,
with operations implemented in C.
This is just the tip of the iceberg. We have only seen a small part of it, but NumPy provides much more than a fast data structure.
Is NumPy really faster than Python?
NumPy is designed to be faster than vanilla Python. Is this really the case? Not all the time. If you use it wrong, it might even hurt performance! To know when it is beneficial to use NumPy, we will look at why exactly it is faster in practice.
To simplify the investigation, our toy problem will be random number generation. Suppose that we need just a single random number. Should we use NumPy? Let’s test it! We are going to compare it with the built-in random number generator by running both ten million times, measuring the execution time.
For generating a single random number, NumPy is significantly slower. Why is this the case? What if we need an array instead of a single number? Will this also be slower?
This time, let’s generate a list/array of a thousand elements.
(Again, I don’t want to wrap the timed expressions in lambdas since function calls have an overhead in Python. I want to be as precise as possible, so I pass them as strings to the timeit
function.)
Things are looking much different now. When generating an array of random numbers, NumPy wins hands down.
There are some curious things about this result as well. First, we generated a single random number 10 000 000 times. Second, we generated an array of 1000 random numbers 10 000 times. In both cases, we have 10 000 000 random numbers in the end. Using the built-in method, it took ~2x time when we put them in a list. However, with NumPy, we see a ~40x speedup compared to itself when working with arrays!
(Again, on my computer; the exact numbers will be different on yours, but the trend will stay the same.)
How is this possible?
Dissecting the code: profiling with cProfiler
To see what happens behind the scenes, we are going to profile the code using cProfiler. With this, we’ll see exactly how many times a given function was called and how much time we spent inside them.
(To make profiling work from Jupyter Notebooks, we need to do some Python magic first. That’s what you see in the first part of the snippet.)
Let’s take a look at the built-in function first. In the following function, we create 10000000 random numbers, just as before.
In Jupyter Notebooks, where the code was written, cProfiler can be called with the magic command %prun
.
There are two important columns here for our purposes. ncalls
shows how many times a function was called, while tottime
is the total time spent in a function, excluding time spent in subfunctions.
The built-in function random.random()
was called 10 000 000 times as expected, and the total time spent in that function was 0.367 seconds. (If you are running the corresponding notebook locally, this number is going to be different.)
What about the NumPy version? The results are surprising.
Similarly as before, the numpy.random.random()
function was indeed called 10 000 000 times as expected. Yet, the script spent significantly more time in this function than in the Python built-in random before. Thus, it is more costly per call.
When we start working with large arrays and lists, things change dramatically. Next, we generate 10000 lists/arrays of 1000 random numbers each, while measuring the execution time.
As we see, about 60% of the time was spent on the list comprehensions: 10 000 calls, 0.579s total. (Note that tottime
doesn’t count subfunction calls like calls to random.random()
here.)
Now we are ready to see why NumPy is faster when used right.
With each of the 10 000 function calls, we get a numpy.ndarray
of 1000 random numbers. The reason why NumPy is fast when used right is that once instantiated, its arrays are extremely efficient to work with. They are like C arrays instead of Python lists. As we have seen, there are two significant differences between them.
Python lists are dynamic, so for instance, you can append and remove elements. NumPy arrays have fixed lengths, so you cannot add or delete without creating a new one.
Python lists can hold several data types simultaneously, while a NumPy array can only contain one.
So, NumPy arrays are less flexible but significantly more performant. When this additional flexibility is not needed, NumPy vastly outperforms Python.
Where is the break-even point?
To see precisely at which size does NumPy overtakes Python in random number generation, we can compare the two by measuring the execution times for several sizes.
Here are the numbers.
Around an array size of 10, NumPy starts to beat Python in performance. Of course, this number might be different for other operations like calculating the sine or adding numbers together, but the tendency will be the same. Python will slightly outperform NumPy for small input sizes, but NumPy wins by a large margin as the size grows.
Conclusion
In the mathematical counterpart of this post, we went from linear regression to the linear layer; from basic to state-of-the-art.
This time, we did the opposite: we went from cozy and comfortable Python lists to (C-style) arrays. You know the saying: less is sometimes more; if not more, certainly faster.
However, thanks to Python and NumPy, arrays are better than the original version. You’ll (probably) never see a dreaded segmentation fault when working with NumPy arrays, and you’ll also have access to a plethora of methods and functions to manipulate them.
In practice, we supercharge the performance even more with GPU-s. Deep learning frameworks such as PyTorch and TensorFlow already do this. Without them, training deep neural networks wouldn’t be possible.
But that’s a topic for another day.
Loved it. I knew these concepts but still a great refresher to the mind. Well done!