The Computer Scientist's Guide to Graph Theory, ep. 01
Running into the woods: trees, forests, and spanning subgraphs
Hi there! This post is a collaboration with Alejandro, taken directly from our upcoming book, The Hitchhiker's Guide to Graphs. This project grew out of our series in graph theory here, aiming to focus on the algorithmic aspects as well. We'll even implement our custom graph library from scratch.
The book is already available in early access, with instant updates to the content as we write it.
Alejandro is one of my favorite writers and one of my original inspirations for starting to talk about mathematics online. We complement each other in knowledge and style: He is a master of computer science and algorithms, and I make the necessary theoretical background as clear as possible.
If you would like to see what's the book like, check out our introductory series on graph theory:
Episode 00: What do your brain, your friends, and the city you live in have in common
Episode 01: Running into the woods: trees, forests, and spanning subgraphs
In our previous entry, we saw that graphs are everywhere. Whenever you want to model relationships between objects, you have a graph. Today, we will explore one special type of graph: trees.
Intuitively, you can picture a tree as a hierarchical structure, with an initial top-level object and several levels of branching children. Examples of trees can be found in all categorizations and taxonomies in most scientific disciplines.
Biological organisms are classified into viruses, bacteria, archaea, and eucaryotas, the later branching into fungi, plants, animals, each of these branching into families and species. An organization is often hierarchical, branching from the CEO or director, down to departments, offices, and individual workers. The file system in your computer is a tree, branching from a main folder (/ in Linux or C: in Windows) into files and folders, which in turn contain other files and folders, and so on. A project can be analyzed as a tree, with top-level tasks branching into sub-tasks, over and over, until reaching the bottom-level, atomic tasks. Even geographical elements are often organized in a tree-like structure, from continents to countries to provinces or states to districts to neighborhoods to individual blocks and houses.
All of these are examples of rooted trees, in which we explicitly mark one special object as the root and from there on, there is a strict notion of a parent-child relationship. (Each object, except the root, having exactly one parent.) This is the most common, and often most practical way to think about trees in computational terms.
However, in graph theory, we work with the more general concept of the unrooted tree. This is a kind of undirected graph that generalizes the notion of objects connected in a hierarchical way, without committing to a particular root. Thus, we can apply all regular graph theory to trees —like paths and distances and connected components— and develop new theory that addresses questions specifically about the nature of trees.
Basic properties of trees
Let’s start by getting the definitions out of the way.
Definition 1. A tree is a connected, acyclic, and undirected graph.
(Acyclic means that it has no cycles.) The following are all examples of trees.
Notice how these trees have no intrinsic hierarchical structure, as there is no explicit root. However, intuitively, we can turn any of these into a rooted tree by selecting a root and then following paths outwards, counting the distance –in number of edges– from the root to every node, thus defining the levels of the hierarchy.
After rooting a tree, we can explicitly visualize the hierarchical structure. (Although there is no intrinsically correct drawing.)
(By the way, a collection of trees is aptly called a forest —because, of course it is, right?)
Since trees are, by definition, acyclic and connected, we can formulate an alternative equivalent definition.
Theorem 1. A graph is a tree if and only if there exists a unique path between every pair of vertices.
To prove this theorem, we need to show two things.
If G is a tree, every pair of vertices have exactly one path between them.
If in an arbitrary graph G, every pair of vertices have exactly one path between them, then G is a tree.
(Remember, by definition, trees are connected and acyclic. Thus, to prove the previous two claims, we will show that violating one of these claims would make the graph disconnected, or it would create a cycle.)
First, we will prove a helpful intermediate result. (These are called lemmas).
Lemma 1. In any graph, two different paths between any pair of vertices imply there is a cycle.
Proof. This lemma seems intuitively true, because if we have two different paths between, say, vertices u and v, we could go from u to v via one path and return via the other. It both paths are disjoint, we’re done. However, these paths could share some common part —actually, more than one— and we still need to show exactly where the cycle would occur.
To prove the general case, we just need to make it explicit which vertices correspond to the cycle. Thus, we can say, let x be the first vertex after which both paths diverge — x could be u, but it can’t be v, for otherwise the two paths would be exactly the same. Now let y be the first vertex after x that is common to both paths — again, it could be v, but not exactly x because then the paths would be equal. Thus, there are two fully disjoint trails between x and y, each at least one edge long.
And, as we shown in the previous episode, when there is a trail, there is a path. Joining these two disjoint paths give our cycle. ◼
Note that the opposite is also true. If we have a cycle, then we have two different paths between every pair of vertices in that cycle — correspondingly, going clockwise from u to v, and going counterclockwise from u to v, as in the trivial case of the previous image.
With this handy lemma in our toolbox, we can easily prove Theorem 1.
Proof of Theorem 1.
Let’s start with the first part: if G is a tree, every two pair of vertices have exactly one path between them. Since G is a tree, it is by definition connected. Thus, there is at least one path between every two pairs of vertices (this is just the definition of a connected graph, from the previous lesson.) Now, suppose there are two distinct paths between an arbitrary pair of vertices u and v, then by virtue of our lemma, G would contain a cycle, which contradicts the claim that G is a tree. ◼
Now, we show that if every pair of vertices have exactly one path between them in G, then G is a tree.
If every two pair of vertices have one path, then the graph G must be connected. Furthermore, if there is exactly one path, then G cannot have cycles, for if there where a cycle, it would imply the existence of at least two disjoint paths between some pair of vertices u and v. Thus, G is connected and acyclic, which means it is a tree. ◼
The size of a tree
Now that we have our alternative definition securely proven, we can explore other properties of trees. The most important one is that the number of edges in a tree is always determined by the number of vertices.
Theorem 2. Every tree G of order n (number of vertices) has size m = n − 1 (number of edges).
Proof. There are several ways to prove this theorem that involve finding contradictions with the definition of the tree. But the most straightforward is just using induction on the number of vertices n.
(If you are not familiar with proof by induction, check out this article.)
For n = 1, the only possible tree is an isolated vertex, which has exactly m = n − 1 = 0 edges. Now, suppose we have a tree of order n > 1 for some arbitrary value of n. Let’s pick a random edge uv and remove it. Since a tree is acyclic, this edge must be a cut edge, thus we now have two connected components, which are themselves trees (both are connected by definition, and we can’t create new cycles by removing edges).
Let n₁ and n₂ be the number of vertices in each component, where n₁ + n₂ = n, because we didn’t remove any vertex. By the inductive hypothesis, we have m₁ = n₁ − 1 and m₂ = n₂ − 1. But since we removed one edge, we have m₁ + m₂ + 1 = m.
Thus, n₁ − 1 + n₂ − 1 + 1 = m. Reordering terms, we have n₁ + n₂ − 2 + 1 = m which implies m = n − 1. ◼
You may have noticed that our induction was a bit weird, in the sense that instead of assuming it works for n - 1 and working our way towards n , we assumed it worked for anything below n and worked our way back to n.
It’s because we are using strong induction, which is a slightly stronger version of induction. In classical induction — also called full induction, you always make exactly one step at a time, jumping from n - 1 to n + 1. In strong induction, you are allowed to jump to n not necessarily just from n − 1, but from any proven step smaller than n. If you accept full induction as true, then you must accept strong induction as well, because they are equivalent.
An alternative definition of trees
Combining the previous results, we can then formulate several equivalent definitions for a tree, all of which are necessary and sufficient. We will give them without any further proof.
Theorem 3. A graph G is a tree if and only if any of the following conditions hold:
G is acyclic and connected.
Between every pair of vertices u, v ∈ V(G) there is exactly one path.
G is connected and has n − 1 edges.
G is acyclic and has n − 1 edges.
We already saw conditions 1. and 2., and also saw how they imply 3. and 4. All that remains would be to prove that 3. and 4. also imply 1. or 2.
Ultimately, what remains is to show that:
If a graph is connected and has n − 1 edges, then it has no cycles.
If a graph is acyclic and has n − 1 edges, then it is connected.
We will pull a classic professor move here and leave the proofs as exercises for the reader :) Leave us a comment with your idea for a proof!
Spanning trees
Moving forward, we will now explore one fundamental application of trees in the general graph theory.
Imagine you are a city planner tasked with connecting all buildings to the electrical network. There is one central power plant, and all wires must go over some road. Of course, we want all buildings to be connected to the power plant without any redundant connections — that is, every building must be connected to the power plant through a unique line, thus the power lines graph should be a tree.
(Example graph of a city with the vertex that represents the power plant in red, blocks as vertices, and roads as edges)
The kind of tree we’re looking for in the previous examples is called a spanning tree, which, intuitively, is a tree that covers all vertices of a given graph. Let’s see the definitions first.
Definition 2. Given a graph G, a subgraph H of G (denoted by H ⊆ G) is any graph constructed by taking a subset of the vertices of G (possibly all), and then a subset of the corresponding edges.
A subgraph is just what the names indicate, a piece of a graph that is also a graph in itself. Now, there are two interesting types of subgraphs: induced and spanning subgraphs.
Definition 3. Given a graph G and a subgraph H ⊆ G, we say that
H is an induced subgraph, if V(H) ⊆ V(G) and all of the edges touching V(H) are in E(H),
and H is a spanning subgraph, if V(H) = V(G).
Here is an illustration that makes the difference between arbitrary, induced, and spanning subgraphs clear.
We will take more about induced subgraphs in later lectures, but now we will focus on spanning trees.
Definition 4. A spanning tree T of a connected graph G is a spanning subgraph of G that is also a tree.
Finding spanning trees is one of the most important applications of tree theory in computer science, since a spanning tree is the most efficient way to connect a graph, in terms of using the least amount of edges.
The first result we will prove is that, given the right circumstances, we can always find this spanning tree.
Theorem 4. Every connected graph G has at least one spanning tree.
Proof. To prove this theorem we can use a similar argument to our first theorem in our previous lesson. Since the graph is connected, it already contains all the vertices, so the only reason why it may not already be a spanning is because it has extra edges — that is, it has some loops.
Thus, we can always find a non-cut edge and remove it. If there is none, then the graph must be a tree. Otherwise, removing a cut edge, by definition, won’t disconnect the graph.
Since the number of edges is finite, we will always reach a point were no more edges can be removed. ◼
(An alternative but equivalent, although perhaps more formal proof is to show that among all connected spanning subgraphs, the smallest ones — in terms of number of edges — have to be trees. Otherwise, we could remove one edge and get a smaller, connected spanning subgraph.)
Optimal spanning trees
Theorem 4 is good news for our city planning task because it tells us we can always find a way to connect all buildings to the power plant without unnecessary redundant wires. However, there could be more than one spanning tree, not necessarily with the same characteristics.
For example, if we think of the edges (i.e, roads in our previous example) as having some associated cost (e.g., the length of the road), it makes sense to ask for the shortest spanning tree — that is, the one that has the minimum total wire length.
In computer science terms, this is called a minimum cost spanning tree, which assumes all edges have some associated cost. Certainly, there are a couple of clever algorithms to compute it efficiently, but since this series is about graph theory, we won’t get into programming issues at the moment. We will return to minimum spanning trees in a future series on graph algorithms.
There is, however, another interesting notion of “optimal” spanning tree that doesn’t require edge costs. Imagine, for example, that we want all our houses connected to the power plant using the minimum number of posts — maybe because that minimizes the energy lost in power transformers. I don’t know, I’m no physicist.
In this case, what we want is distance-conserving spanning tree, that is, a tree in which the distance from every vertex u to some predefined vertex v is the same as in the original graph. Before seeing how we find that tree, let’s settle a couple definitions.
Definition 5. In a graph G, the distance between two vertices u and v (not necessarily distinct) is denoted as dist(u, v) and defined as the number of edges in the shortest path between u and v. If there is no such path, then dist(u, v) = ∞.
Thus, a distance-conserving spanning tree is defined with respect to an origin vertex v — the power plant in our example — and constructed in such a way that the unique path between v and every other vertex u is precisely the shortest path (in terms of number of edges).
As an example, here are two spanning trees, one of which conserves distances with respect to the central vertex, while the other doesn’t.
(Notice that all spanning trees have exactly the same total number of edges, of course. But the distance-conserving is that one in which, on average, the path from the origin to every other node is shorter.)
We will finish this lecture by showing that we can always find a distance-conserving spanning tree, and at the same time give an algorithm to build it. Don’t worry, no coding is required to understand this algorithm.
Theorem 4: Let G be a connected graph. For every vertex v, there is at least one spanning tree that conserves distances with respect to v.
Proof. To prove this theorem, we will simply show how to build such a tree. We will perform what is technically called a breadth first search (BFS) algorithm on the graph. Intuitively, this just means starting at our origin vertex v, and iteratively expanding outwards in “circles” of one edge at a time, thus first finding all vertices at distance 1, then 2, and so on.
A bit more formally, the BFS algorithm works as follows:
Mark the origin vertex v as having dist(v, v) = 0.
Find all unmarked vertices w adjacent to some vertex u marked with a finite
dist(v, u) = k, and mark then as having dist(v, w) = k + 1.Repeat Step 2. until no more unmarked vertices can be found.
(Technically, this only really works in connected graphs. But if there are unreachable vertices from v, once finished, we can just mark them as having dist(u, v) = ∞ as per the definition of distance for unconnected vertices.) ◼
The following image shows this process iteratively, highlighting which vertices are discovered after each iteration. When finished, we obtain a spanning tree that conserves distances from the origin vertex.
The proof that BFS actually preserves distances is not difficult but unnecessarily technical for our purposes. Let’s give a sketch instead!
The essence lies in realizing that at iteration i we reach all vertices at distance i from the origin. This can be proven via induction on the number of iterations, starting at zero — the only vertex with distance 0 to v is v itself. Then, if a w vertex is at distance i from v, it must be the case that the shortest path from v to w contains some vertex u adjacent to w that is at distance i − 1 from v. By the induction hypothesis, then u must have been discovered in the previous iteration, and thus w will be discovered next.
Trees are one of the most fascinating and well-studied topics in graph theory and computer science in general. We will come back to trees over and over in the next few lessons, as many proofs for other topics in graph theory involve finding some tree-like structure and exploiting its properties.
There is much more to be said, especially with respect to algorithms for solving many interesting problems in trees. In our future series on graph algorithms, we will dive much deeper into applications of trees and discover some of the cleverest tricks for designing efficient algorithms.