Quick update before we start:
The Palindrome lecture series kickstarts on September 26th.
Join the annual plan to get access to the upcoming live courses on Neural Networks from Scratch and Mathematics of Machine Learning.
In our last article, we established that to properly analyze an algorithm, we need to answer three questions:
Is it correct?
How much does it cost?
And can we do better?
We proved that Detective Turing’s linear search was correct and optimal. In the process, we made some bold claims, casually stating that an algorithm that takes 3N steps is, for all intents and purposes, the same as one that takes N steps.
This might feel like cheating. How can we just ignore numbers in a mathematical discipline? We can’t—not without a very good reason and a solid framework. This article is about building that framework.
Let’s start with a story of two coders.
The Two Coders
The Mayor of Algolimpia has launched a public challenge. For centuries, the city has kept meticulous records of its daily Abacus Duels with the neighboring city of Graphtopia. The Mayor wants to know: what is the largest cumulative point lead Algolimpia has ever held over a continuous period of days? For this prestigious task, two brilliant Algolimpians present their solutions.
On one hand, we have Alice an expert systems programmer. She goes straight to the issue, devising the most straightforward solution: she’ll check every possible start day, and for each of those, check every possible end day, summing up the duel results for each period. She spends the whole week implementing this in highly-optimized C++, using all tricks of the trade to make her program start almost instantly and perform every single operation blazingly fast.
On the other hand, we have Bob, a thoughtful algorithm designer known for spending more time at the whiteboard than at the keyboard. He spends the whole week thinking about this problem and trying different alternatives. In the end, he realizes he only needs to make a single, clever pass through the historical data. As he scans the records, he keeps a running total of the current period. His key insight is that if this running total ever drops below zero, it’s a “lost cause”—it can’t possibly contribute to a higher total in the future.
So, whenever he finds a running total smaller than zero, he simply discards it and starts a new period from the very next day. This simple rule allows him to find the overall maximum without ever looking back. After a week of chalkboarding, he vibe codes this idea in a couple of minutes in the laziest, most unoptimized Python you’ve ever seen. His program has a massive startup overhead and every operation is about a 100 times slower than in Alice’s. It is, by all conventional measures, a crappy piece of code.
When the time to test both algorithms comes, Alice’s algorithm starts to shine. It is blazingly fast. It beats Bob for all the small test runs. However, when they try the whole dataset—tens of thousands of data points—Alice’s program grinds to a halt, while Bob’s algorithm starts to look better and better the bigger the dataset gets. The judges are baffled. How can an unoptimized, lazy solution best a masterfully crafted one? The answer lies beyond the code itself, in the fundamental rate of growth of their chosen methods.
Analyzing the Algorithms
To better understand what we’re dealing with, let’s take a look at both algorithms in some detail. Let’s assume Alice coded an algorithm that looks like the following:
Her algorithm is written in C++ and compiled with all known compiler optimizations, and then some she invented along the way. Let’s just say it’s damn fast. But the core of the algorithm is that nested loop.
In contrast, Bob’s algorithm looks something like this:
It is not immediately trivial that Bob’s algorithm works. We will not analyze correctness in this article, because we are interested in performance. However, this is a simple case of a dynamic programming algorithm—specifically Kadane’s Method if you want to Google it—a technique we will explore in future entries, so rest assured you will learn how to write this kind of algorithms.
In the meantime, if you are unsure Bob’s algorithm indeed works, we suggest you to run it manually on some inputs with positive and negative numbers, so you can gain some intuition into why discarding the running count when it drops below zero makes sense.
Now let’s analyze the runtime cost.
I’ll claim without proof that Alice’s algorithm performs something close to (1/2) ⋅ N² iterations of the inner loop. To make it simple we will greatly overestimate the runtime of each instruction. Let’s assume each iteration of that inner loop in C++ takes something like a millisecond—imagine it’s the 70s or something.
In contrast, Bob’s algorithm clearly performs N iterations, and let’s assume each one of them takes around 100 milliseconds. Additionally, let’s make it even worse and assume Bob’s program takes something like 1 second just to get started because firing up the Python interpreter has some non-negligible cost.
This means we can estimate Alice’s runtime in terms of the size of the data as 0.5N2 ms, and Bob’s as 100N + 1000 milliseconds. As we will see shortly, the exact numbers don’t matter too much, but I want to illustrate the idea that C++ is a significantly faster programming language.
Now, let’s plug in some numbers for a dataset of size N:
For N = 10 days:
Alice: 0.5 ⋅ (10)2 = 50 ms = 0.05 seconds.
Bob: 100 ⋅ 10 + 1000 = 2000 ms = 2 seconds.
Alice wins, it’s not even close.
For N = 200 days:
Alice: 0.5 ⋅ (200)2 = 20000 ms = 20 seconds.
Bob: 100 ⋅ 200 + 1000 = 21000 ms = 21 seconds.
Alice still wins, but just barely. The race is getting tight.
For N = 1,000 days:
Alice: 0.5 ⋅ (1000)2 = 500000 ms = around 8.3 minutes.
Bob: 100 ⋅ 1, 000 + 1000 = 101000 ms = around 1.7 minutes.
Bob wins decisively. From this point on, his lead will only grow.
For small input sizes, the N and N2 terms are sufficiently small such that the runtime is dominated by constants–the things that depend upon the programming language, compiler, and hardware. But for sufficiently large inputs, the constants start to matter less and less, and the N2 term begins to grow catastrophically faster. So much so, that even if Alice were to further optimize her code down to a millionth of a second per operation (or any constant at all), there would still be a point—a crossover input size—where Bob’s algorithm would become faster.
We can visualize what’s going on with an informal illustration like the following:
This is the intuition we want to formalize.
Proving Bob Eventually Wins
To move beyond intuition, we need a formal language to compare these algorithms. Remember, what we want to say is Bob’s linear approach eventually wins, no matter how optimized Alice’s quadratic algorithm is.
There are two ways we could go about showing this.
Method 1: The Direct Comparison
The straightforward way is to compare Bob and Alice head-to-head and prove that Bob’s algorithm grows strictly slower than Alice’s.
For this, we’ll use a tool from calculus called Little o notation.
Formally, we say a function f(n) ∈ o(g(n))—read as, and bear with me, “f(n) is in little-o of g(n)”—if for any positive constant c, there exists a point n0 such that for all n ≥ n0 we have that f(n) < c ⋅ g(n).
The crucial word here is “any”. It means that no matter how much Alice optimizes her code—making her constant factor c a millionth or a billionth—there is always a point n0 after which Bob’s algorithm will be faster. It’s the ultimate proof of dominance.
Let’s prove Bob is o(Alice) then. This means proving that f(n) = 100n + 1000 is in o(0.5n2).
We need to show that for any c > 0, we can find an n0 where 100n0 + 1000 < c ⋅ (0.5n2). We can find the exact point where the two curves intersect by solving the following quadratic equation:
As you may remember, this equation can have at most two real solutions. The largest one will be given by the formula:
Substituting the terms from our equation and grouping, we have:
For example, for c = 1, we have a crossover point at approximately 210. This means that just after 210 data points, Bob’s algorithm will already be faster than Alice’s. However, even if you make c much smaller, like 10−6, we still find that after approximately 200 million data points, Bob’s algorithm will become better.1
So, we have proven once and for all that Bob’s linear approach will always beat Alice’s double loop algorithm, regardless of how optimized it becomes.
This is a powerful way to compare two specific algorithms, but it’s cumbersome. We don’t want to have to solve all these equations for every single pair of algorithms we encounter.
There’s a much better and faster way.
Method 2: Complexity Classes
Instead of comparing algorithms one by one, we can instead try and sort them into broad complexity classes. That is, we want to formalize the claims “Bob’s algorithm is linear” and “Alice’s algorithm is quadratic,” and we know that any linear algorithm will eventually beat any quadratic one.
The most important tool for defining these classes is the Big O notation, which defines an upper bound on an algorithm’s growth. It’s definition is very similar to Little o but a bit more relaxed.
Formally, we say a function f(n) ∈ O(g(n)) if there exist positive constants c and n0 such that for all n ≥ n0 we have f(n) ≤ c ⋅ g(n).
Now, instead of any constant c, we can just pick one specific constant value and compare the two functions after one specific n0 crossover point. Should be a bit easier than before.
Why is this useful? Because now we can say Bob is in O(n) while Alice is in O(n2), and instead of comparing them directly, we compare the complexity classes themselves, and we already know anything in O(n) grows smaller than anything in O(n2). We use the complexity classes as proxies to avoid having to do a full proof for every constant c any time we need to compare two specific algorithms.
We can easily prove that Bob’s runtime, 100n + 1000 is in O(n). We need to find constants c and n0 such that 100n + 1000 ≤ cn. If we pick c = 101 and n0 = 1000, the inequality holds. This places Bob in the linear complexity class.
Likewise, we can prove that Alice’s runtime, 0.5n2, is in O(n2). This one is even simpler, if we just pick c = 1 and n0 = 1, the inequality holds. This places Alice in the quadratic complexity class.
By classifying them first into complexity classes, we can forget the messy constants and make a simple, powerful statement: Bob’s linear algorithm is fundamentally more efficient than Alice’s quadratic one.
Case solved? Not exactly. There was a sleight of hand here that you may have missed. The thing is, Big O is an upper bound, but not necessarily a tight one. You see, Bob is also O(n2), which can be easily proven just by the fact that belonging to Big O, which involves a less-or-equal comparison, is a transitive property. This means if f(n) ∈ O(g(n)) and g(n) ∈ O(h(n)) then f(n) ∈ O(h(n)).2
So, in that same vein, it could be the case that Alice is also O(n). And if that were so, our comparison would be moot. We need to show not only that Alice is bounded by a quadratic growth, but also that it is a tight bound. There is no slower growth function that is also an upper bound of Alice’s algorithm.
Tightening the Bounds
To get a tighter bound, we will flip the definition of Big O to create a lower bound first. This is called Big Omega (Ω). The definition is the same, but with greater-or-equal. That is, a function f(n) ∈ Ω(g(n)) is there exists two positive constants c and n0 such that for every value n ≥ n0 we have f(n) ≥ c ⋅ g(n).
I’ll leave you as another homework to prove that indeed Bob is Ω(n) and Alice is Ω(n2). It’s not that hard. Once done, we can introduce a new definition, called Big Theta (Θ), which is exactly what we want: a tight bound.
Formally, we say a function f(n) ∈ Θ(g(n)) if f(n) ∈ O(g(n)) and f(n) ∈ Ω(g(n)). That’s it. We have the same complexity class as upper and lower bound, so our algorithm is exactly placed in that complexity class.3
Incidentally, we can also prove that this particular problem (finding the maximum sum subarray) has a minimum complexity of Ω(n), which means Bob has the asymptotically optimal algorithm! To prove this, we could use an adversary-style argument like in the previous article. Intuitively, if you don’t consider all the numbers, I can always place a very large or very small number in the unchecked spot to falsify your result.
This is an important distinction to draw, between saying some given algorithm has an O(g(n)) or Ω(g(n) bound and a given problem has such bounds. Specific algorithms for the same problem may vary widely in performance, as we’ve just seen with Bob and Alice. One is linear and one is quadratic.
The existence of a given algorithm that is O(g(n)) tells us that the underlying problem is at most O(g(n)), because we already found an upper bound. But to claim a given problem is Ω(g(n)), it is not sufficient to find an algorithm with that lower bound, or even a tight bound at Θ(g(n)). There could exist other algorithms that are even faster. To prove a lower bound for a problem we need some sort of formal argument like the adversary-style proof we did last time, or another formal technique. We will explore more ways to do this in future articles.
Rules of Thumb for Asymptotic Analysis
Now that we have a fairly complete theory of algorithmic analysis, the next pain point to sort out is how to determine the upper bounds of any given algorithm without resorting to cumbersome algebraic proofs. To do so, we will first introduce some properties of Big O that will serve us to simplify its computation. I’ll give you these without proof because the article is already getting quite long, but they are not especially hard.
Property 1: If f(n) ∈ O(c ⋅ g(n)) then f(n) ∈ O(g(n)). This means we can drop all constants and just keep the polynomial (or other functional) terms.
Property 2: If f(n) ∈ O(g(n) + h(n)) then f(n) ∈ O(max(g(n), h(n))). This means you can drop all but the biggest term.
Given these properties and some intuitive notions, we can devise the following rules of thumb to quickly calculate the upper bound of most iterative algorithms:
Rule 1 (Multiply Loops): A single loop over N items is O(N). Nested loops are multiplied; a loop inside a loop is O(N2), three nested loops is O(N3), and so on.
Rule 2 (Add Sequential Steps): When an algorithm has multiple sequential steps, their runtimes are added. For example, an O(N2) step followed by an O(N) step is O(N2 + N).
Rule 3 (Drop Constants & Keep the Dominant Term): When you have a combined runtime like O(N2 + N) or a messy one like Bob’s (100N + 1000), you simplify by dropping all constant factors and keeping only the fastest-growing term. O(100N + 1000) becomes O(N), and O(N2 + N) becomes O(N2).
These rules let us look at most iterative algorithms and immediately know which complexity class they belong to. Alice’s algorithm has two nested loops, so its clearly O(n2), and Bob has one nested loop, so it’s clearly O(n). All the rest are constants we can drop.
With enough practice, you can quickly learn to identify which parts of an iterative algorithm correspond to which typical growth functions (usually polynomials) and compute the upper bound just by looking at the code. The lower bound is sometimes harder to compute because your algorithm could return quickly in some cases, but if you perform a worst-case analysis, which is often the case, you can assume all loops are completed and thus the lower bound is usually the same as the upper bound (at least in simple iterative algorithms).
The Paradox of Super-Linear Growth
“Okay,” you might argue, “but what if our computers get faster? Way faster?” With technology roughly doubling in power every couple of years, surely Alice’s super-optimized algorithm will just get better and better, eventually erasing Bob’s advantage, right?
Wrong. It’s actually the opposite. Alice’s algorithm gets worse relative to our ambitions the better our computers get. Here’s why.
The goal of progress is that when we buy a computer that’s twice as fast and has twice as much memory, we want to solve a problem that’s twice as big in the same amount of time as we previously could.
With Bob’s O(N) algorithm, this works perfectly. A problem twice as big (2N) on a computer twice as fast takes (2N)/2 = N time. We solve a bigger problem with a faster computer in the same time.
With Alice’s O(N2) algorithm, the math is devastating. A problem twice as big (2N) on a computer twice as fast takes (2N)2/2 = 4N2/2 = 2N2 time. Even with a machine that is twice as powerful, it takes us twice as long to solve the scaled-up problem.
Any algorithm with super-linear complexity (O(N2), O(N3), etc.) creates a scenario where our tools get better, but our ability to solve the next generation of problems gets worse. We fall further and further behind. This is why the search for efficient algorithms is not just an academic exercise but a fundamental requirement for continued scientific and technological progress.
Conclusion
The story of Alice and Bob wasn’t just a fun example. It reveals a fundamental truth: focusing on implementation details before understanding the underlying complexity is like polishing the brass on the Titanic.
Asymptotic analysis is the lens that allows us to see past the misleading, short-term details of implementation and focus on the timeless, structural truth of an algorithm. It allows us to make profound, universal statements. We can say that Bob’s approach is fundamentally and forever better than Alice’s for large inputs, regardless of the hardware we will invent in the future.
We’ve developed a language to compare the infinite, not by their specific values, but by their rate of growth. This mathematical lens gives us a glimpse into the inherent, timeless complexity of the problems we seek to solve.
It’s a language pioneered by computer scientist Donald Knuth, who imported the notation from number theory and applied it to the analysis of algorithms. His foundational work, particularly in his book series The Art of Computer Programming, established the field and earned him a Turing Award in 1974. For this, he is rightly considered the father of modern algorithmic analysis.
But this kind of analysis is almost trivial whenever we have simple iterative algorithms. When we cross into the realm of recursive algorithms—algorithms that call themselves—our simple rules of thumb for computing asymptotic complexity don’t work. The complexity of these algorithms is described by a recurrence relation, which can be much harder to solve analytically. But for this, we also have shortcuts. Stay in the loop for the next entry in the series if you want to know more.
If this seems like a very large number, consider that current machine learning models which run on a consumer computer can have something like 10 billion parameters, meaning they perform that many operations for each word they generate.
In fact, it is easy to prove that Big O establishes a partial ordering among functions, but we’ll leave this proof for homework.
As a side note, Big Θ defines an equivalence relationship, which means you can partition all functions into complexity classes. The partition has infinitely many classes, though.
You're so close to revealing the question that begs to be asked. Holding code language constant, the language itself doesn't matter. That's correct. They wash out asymptomatically in complexity theory. Big O ignores coefficients because they don't change the shape of growth, just the slope, the eventual growth rate is the same class whether lin, quad, exp, etc. Almost<- all "wink" real systems will inherit the shape of the S-curve because eventually external constraints kick in. In intelligence systems these constraints manifest as a result of things like throughput capacity and number of users. The goal of maintaining balance in the inflection point of their journey along sigmoidal saturation, as can be seen, is a losing battle. Before the midpoint a wealth of super-linear growth. But positive feedback loops stagnate beyond that midpoint with the same additions barely moving the needle due to resource contention, latency, or bandwidth limits. Maintaining 3 maximums at K/2 (K being the carrying capacity) becomes unsustainable past the inflection point. Noise, pollution, friction, and data drag collapse any system's signal to noise ratio. This is where we are with current SOTA. In steps the BRUTE FORCE approach throwing compute at it's sigmoidal enemy like bare knuckle boxer trying to overpower entropy's oldest friend by sheer power alone. But we all know who wins this fight. Capacity inflation isn't efficiency innovation. You cannot change the shape of the curve, you can only stretch K outward. You can optimize all you want, every system eventually bends because of resource finiteness and the current undefeated champion of the universe, entropy. So, what's the question? If all ends inevitably the same, whether code or compute, then how do we ever move past the curve. Fortunately for you, if you were kind enough, or bored enough, or smart enough to make it this far, I'm happy to inform you, that the S curve isn't invincible. In a universe where everything is LLM shaped, this solution is just as cool and unique as you might already imagine it to be.