Greedy Algorithms

(Algorithms chapter 5)

Greedy strategies take a very natural approach to solve a problem. Given the choices that we have made so far, make the choice that provides the most benefit going forward. We can think of a greedy algorithm as taking the choice that looks locally optimal. However, generally our goal is to find a global optimum. When do these locally optimal moves produce a global optimum? We start by seeing some examples.

Def 1. (Matching) Given a graph G=(V, E), a subset of edges M \subset E is a matching if every two edges in E has disjoint endpoints.

Maximum Matching Problem

Given a graph G=(V, E), find a matching of maximum cardinality.

A greedy approach:  Pick any available edge that extends the current matching.

Does the above algorithm always output a maximum matching of the graph? No. Consider the example below which is a path of length three. If the green edge is chosen as the first edge by the algorithm, the algorithm stops there. The optimal solution though is a matching of size 2 (red edges).

Screen Shot 2018-09-18 at 4.05.44 PM

Shortest Path Problem

Given a graph G=(V, E) with weights on the edges,  and vertices (s,t), find a path between s, t with minimum total weight.

A greedy approach: Start with s and at each step choose the neighbour that gives the shortest path to t.

Does the above algorithm always find a shortest path? No. Consider the example below. At the first step, the algorithm chooses the neighbour c cause the distance of c to t is 6 versus the distance of a to t which is 8. Therefore the greedy algorithm outputs the path (s, c, d, t). However, the shortest path is (s, a, b, t).

Screen Shot 2018-09-18 at 4.20.04 PM

Minimum Spanning Tree (MST) Problem

The next example is a problem (computing the minimum spanning tree of a graph) where a greedy algorithm finds the optimal solution. Before getting into the problem, we need to introduce some definitions of a graph G=(V,E).

Def 2. (Connected Component) A connected component is a subset of vertices M \subset V such that there is a path between any two vertices in M.

Def 3. (Spanning Subgraph) A subgraph that includes every vertex in V.

Def 4. (Tree) A tree is a connected component of G with no cycles.

Claim 5: A minimal weight spanning connected component  F\subseteq E is a spanning tree. That is,

  1. F has no cycles
  2. F has |V|-1=n-1 edges; i.e. |F|=n-1.

Proof: (1) Suppose F had a cycle. Then, we can remove any edge on the cycle while maintaining connectivity between every pair of vertices. Therefore, there can be no cycles if F is minimal.

(2) We prove by induction that |F|=n-1. It’s clearly true for the base cases n=1,2. Now, for n>2, remove any edge in F. Then, F \backslash \{e\} has 2 components (otherwise F was not minimal), and each component has n_1, n_2<n vertices with n_1 + n_2 = n. By induction,

|F| = (n_1-1) + (n_2-1) + 1 = n-1.

Minimum Spanning Tree (MST) Problem. Given a graph G=(V, E) and weights on the edges, find a spanning tree of G with minimum total weight.

A greedy approach: One possible approach is to simply select the lightest weight edge. Locally, this seems like the best edge to select. However, there is a potential problem: what if the lightest weight edge creates a cycle? Well, then we simply ignore that edge and proceed to the next lightest edge. We have essentially described Kruskal’s Algorithm:

Kruskal’s Algorithm for solving MST:


  • Sort edges in increasing order of weight; so have an ordering e_1, \ldots, e_m such that w(e_i) \le w(e_{i+1}).
  • Set T = \{\}
  • for i=1 to m
    • Add e_i to T if it doesn’t create a cycle.
  • Return T.

Screen Shot 2018-09-21 at 4.58.07 PM

Theorem 6 (Kruskal): Kruskal’s algorithm produces a minimum weight spanning tree of G=(V,E).

Before proving this theorem, we need to first introduce some definitions and lemmas.

Def 7. (Cut) A cut (S,\bar S) is the set of edges between a subset of vertices S \subseteq V and its complement \bar S = V \backslash S. That is,

(S,\bar S) = \{(u_i, v_j) \in E | u_i \in S, v_j \in \bar S\}.

Lemma 8. Fix any edge e whose weight is the minimum in some cut (S,\bar S). Then, there exists an MST T containing e.

Proof: Suppose e is not in an MST. Consider an MST T and add e to T. Then, e induces a cycle C (since T \cup \{e\} has two paths between endpoints of e).

Consider the cut (S,\bar S) for which e is a minimum weight edge. Then C must cross (S,\bar S) in some other edge f with weight(e) \le weight(f).

Set T' = T \cup \{e\} \backslash \{f\}. Note that weight(T') \le weight(T), which means T' is an MST (since it connects every pair of vertices and is minimum weight).

Lemma 9 (cut property). Let X be a subset of edges of an MST of a graph G. For any edge e such that e is a minimum-weight edge of a cut (S, \bar{S}) and X \cap (S, \bar{S}) = \phi, there is an MST T s.t. X \cup \{e \} \subseteq T.

Proof. Take some MST T that contains X.  Let (S, \bar{S}) be a cut in which e is a minimum cost edge and assume that e \notin T. Add e to T. This creates a cycle. At least one other edge of the cycle e' must be in (S, \bar S). Since e was the edge of the cut with the smallest weight, T \cup \{e\}\setminus \{e'\} has lower weight than T. This contradicts T being an MST.

Proof (of Kruskal Theorem): Using Lemmas 8 and 9, we can now prove that Kruskal’s algorithm always produces an MST. When the next edge

When the next edge e is added, we can identify a cut (S, \bar S) such that e is the first edge across this cut. Let X be the edges that has been added so far. According to Lemma 9, e belongs to some MST. We add e to X and repeat the process.

Running time of Kruskal’s algorithm: O(m \log n).

Sorting the edges once up front takes O(m \log n).

However, we also have to keep track of whether an added edge creates a cycle, i.e. when an edge connects two vertices in the same connected component. We can do this via a simple data structure.

For each vertex, introduce a component number, which is initially just the index of that vertex. When Kruskal’s algorithm adds an edge which connects two vertices with the same component number, we skip that edge because it creates a cycle. If the component numbers of the two vertices are different, then we relabel the vertices in the smaller component to the label of the larger component. Since we relabel the smaller component, the size of a vertex’s component at least doubles each time it is relabeled. Therefore a vertex’s label changes at most O( \log n) times. Thus the total cost of this data structure is O(n \log n).

Prim’s Algorithm:

Another algorithm to find the MST of a graph is Prim’s algorithm.

  • Start with any vertex, and mark it as visited.
  • Add the lightest weight edge from some visited vertex to an unvisited vertex; repeat until all vertices are visited.

As with Kruskal’s algorithm, its correctness follows from the cut property given above.

Running time of Prim’s algorithm: O(m \log n).


Can we define a more general class of problems that could be solved by the above greedy algorithm?

Def 10. (Matroid) A matroid consists of a universe of elements U and a collection I of independent subsets of U with the following properties

  • A \subseteq B, B\in I \Rightarrow A \in I
  • |A|<|B|, A,B \in I \Rightarrow \exists e \in B s.t. A\cup \{e\} \in I

A Greedy approach. Sort elements of U in increasing order of weights. Add the next element in the list that preserves the independency.

Theorem 11. The above greedy algorithm finds the maximum weight independent set.

Proof. We prove by contradiction. Let x_1 \geq x_2 \geq X_3 \ldots be the elements in the output greedy algorithm in descending weight order and let y_1, y_2, \ldots be the elements in the maximum independent set in descending weight order. Since the total weight of the optimal is higher than the total weight of the greedy solution, consider k to be the first index that w(x_k) < w(y_k). Then |\{x_1, \ldots, x_{k-1}\}| < |y_1, \ldots, y_k| and both are independent. By the matroid property \exists y_j \in \{ y_1, \ldots , y_k \} such that \{ x_1, \ldots, x_k \} \cup \{y_j\} is an independent set. But w(y_j) \geq w(y_k) > w(x_k) and this contradicts the fact that the greedy algorithm chose x_k and not y_j.

Observation 12. Spanning trees form a matroid. Let’s define U = E and I as the subset of edges with no cycles (forests). It is simple to see that this satisfies the two matroid properties. Given this, Theorem 11 states that the greedy algorithm stated above finds the maximum spanning tree. This can be used to solve the MST problem by simply defining the weights M_{ij} = W - w_{ij}. Then finding an MST on a graph with weights w is equivalent to finding the maximum spanning tree on the same graph with weights M.

Dynamic Programming

Divide & Conquer works when a problem can be solved by solving smaller problems of the same type.

In Dynamic Programming (DP), we typically have to define and solve a more general problem.

Shortest Path

Problem: Given a graph G = (V, E) with edge lengths l(e), find the shortest path from vertex s to t.

  • If all edges were of unit-length a breadth first search could solve the problem in O(|V|+|E|).
  • Previous solution can be extended to weighted version by breaking edges into smaller unit edges, though the complexity of such a solution would be exponential on size of input! (An edge of length L can be represented in O(\log (T)) bits)

Observation: A shortest s-t path P is also a shortest s-u path, for all vertices u on P.

More General Problem: Find shortest s-u paths, \forall u.

Algorithm (Dijkstra):

Idea: Maintaining tentative distances to each vertex, i.e., T(u).

  • T(s) = 0
  • T(u) = \infty, \forall u \ne s
  • S = \{ \}
  • while S \ne V:
    • Let j be the vertex not in S with smallest T(j)
    • S \leftarrow S \cup \{ j \}
    • Update T(u) \leftarrow \min \{ T(u), T(j) + l(j,u) \}

Theorem: Dijkstra’s algorithm correctly finds shortest path from s to all vertices.

Either of the following lemmas immediately result the theorem.

Lemma: At every step, for every vertex u \in ST(u) is the length of shortest s-u path and members of S are the |S| nearest vertices to the s.

Proof: We prove this by induction on |S|. Let’s name the set after k iterations S_k. The basis trivially holds for |S_1| = 1. Induction step: (k to k+1) Let u be the k+1‘th distant vertex from s. The previous (to the end) vertex on a shortest s-u path (name it v) is nearer to the source and by induction hypothesis is in S_k and T(v) is distance of v from the source. Considering the algorithm at a previous step, when v was added to S, it has updated its neighbor u so T(u) \leq T(v) + l(v,u). On the other hand at any point of the algorithm and for all vertices x, T(x) is at least equal to distance of x from the source, because according to updates sequence there exists a path of that length from source to x. Hence, we must have T(u) = T(v) + l(v,u), equal to distance of u from the source. As for all other vertices x \in V \setminus S_k \setminus \{u\} we have T(x) \geq d(s,x) \geq d(s,u) \geq T(u), u will be the vertex added into S at this step for which T(u) is distance from the source and it is the k+1‘th distant vertex from s as required for S_{k+1}.

Alternatively we can prove the theorem using the following lemma.

Lemma: At every step, for each vertex u \in S, T(u) is the shortest path distance from s to it and there is a s-u path of length T(u) that falls wholly inside S.

Proof: Similarly, we have simple induction on |S|. Basis holds for S_k = S_1 = \{s\}. Induction step: Assuming the lemma for k, Let u be the nearest vertex to source from V \setminus S_k. A shortest path from s to u starts in S and ends outside of it. By the choice of u all previous vertices in the path to it must be in S, else we have contradiction by choice of u. Name the previous (to u) vertex on that path v. Considering the algorithm at a previous step, when v was added to S, it has updated its neighbor u so T(u) \leq T(v) + l(v,u). On the other hand at any point of the algorithm and for all vertices x, T(x) is at least equal to distance of x from the source, because according to updates sequence there exists a path of that length from source to x. Hence, we must have T(u) = T(v) + l(v,u), equal to distance of u from the source. As for all other vertices x \in V \setminus S_k \setminus \{u\} we have T(x) \geq d(s,x) \geq d(s,u) \geq T(u), u will be the vertex added into S at this step for which T(u) is distance from the source and there exist a shortest path to it using only vertices in S_{k+1}, hence completing the induction hypothesis.

Note: Using data structures like heap, the runtime for above procedure would be O( (|V|+|E|) \log (|V|)).

Note: We found the length of shortest path to each vertex. To retrieve the shortest path itself, we could iteratively check all neighbors of destination for being a previous node in a shortest path, or we could store this information while updating its T().

Longest Increasing Subsequence

Problem: Given a sequence of n numbers a_1, \ldots, a_n, find (the length of) its longest increasing subsequence (LIS).

Example: LIS for 1, 4, -1, 0, 6, 2, 4, 5 is -1, 0, 2, 4, 5 that is of length 5.

Note that D&Q technique does not work for this problem.

Let’s solve a more general problem: Finding (length of) LIS ending at element a(i) for all indices i, name it s(i).

Recursion: s(i) = \max \{ 1+s(j) : j < i, a(j) <= a(i) \}.

Using previous observation we solved the LIS problem, computing all s(i) in time O( n^2 ).

Note: This problem has an O(n \log n) solution, using monotonicity of smallest tail of increasing subsequences of fixed lengths.

Matrix Chain Multiplication

Problem: Given k matrices M_1, \dots, M_k, want to compute M_1 \times \dots \times M_k most efficiently by choosing the order in which we perform multiplications.

For simplicity say for each multiplication we use the naive method, taking time O(m n p) to multiply an m \times n and an n \times p matrix, and we only want to find the minimum cost of computing the final matrix.

Let the i‘th matrix be of size m_i \times m_{i+1}.

Example: If sizes of matrices be (2 \times 6), (6 \times 10), (10 \times 6), multiplying the left pair first results in total cost of 2 \cdot 6 \cdot 10 + 2 \cdot 10 \cdot 6 while the other scenario (right first) costs 6 \cdot 10 \cdot 6 + 2 \cdot 6 \cdot 6.

More general problem: For every i \leq j, what is the (minimum) cost of multiplications from the i‘th matrix to the j‘th one, i.e., M_i, \dots, M_j, name this value C[i,j].


  • C[i, i+1] = m_i m_{i+1} m_{i+2}
  • C[i, j] = \min_{i < k < j} C[i, k] + C[k+1, j] + m_i m_{k+1} m_{j+1}

Analysis: Storing C[,] in a two dimensional array, we need to compute values for elements above the diagonal, computing O(n^2) values where each takes O(n) time, summing up to O(n^3) for our total computation.

Note: In DP we usually use smaller instances of the general problem to solve larger ones. While we compute the answer for an instance using others it is important that referred instances be previously solved and known. In last example an acceptable order of finding values C[i,j] is in increasing order of j-i, i.e., diagonals of 2-d array C moving up, starting at the main diagonal.

Time & Space Hierarchies

Complexity of an algorithm measures how much of a certain resource the algorithm uses, such as time, space, or number of random bits.

How much power is granted to a machine by additional resources? Perhaps we get a hierarchy of languages, where each level of the hierarchy is the set of languages that are recognizable by a TM with a certain amount of space.

First, let us attempt to define some sets of languages that will help us reason about time and space hierarchies

SPACE(s(n)) = languages accepted by TM’s that use at most s(n) space.

TIME(t(n)) = languages accepted by TM’s that run for at most t(n) steps.

Note that TIME(f(n)) \subseteq SPACE(f(n)) since any language L that is accepted by a turning machine M using at most f(n) steps implies that M also uses at most f(n) space, hence L \in SPACE(s(n))

Theorem SPACE(f(n)) \subseteq TIME(2^{O(f(n))})


Consider the current “configuration” of a TM, specified by (q, head position, tape contents). Therefore, with space f(n), we can bound the number of configurations as:

\begin{aligned}  \text{\# possible configurations} &\le |Q| \cdot |\Gamma|^{f(n)} \cdot (f(n) + n)\\  &= c_1 \cdot c_2^{f(n)} 2^{\log_{2}(f(n) + n)}\\  &\le 2^{cf(n) + \log_{2}(f(n) + n)} \\  &= 2^{O(f(n))}\end{aligned}

Since TM must terminate, total steps taken must be at most time taken to pass through every allowable configuration, which is 2^{O(f(n))}

Now we define formally the notion of the space requirement of computing a function.

Definition. A function s: \mathbb{N} \rightarrow \mathbb{N}, where s(n) \ge \log n, is space-constructible if there exists a TM M which on input 1^n outputs s(n) in binary and uses O(s(n)) space.

Theorem (Space Hierarchy). For any space-constructible function s(n), there exists a language L such that

L \in SPACE(O(s(n)))     L \notin SPACE(o(s(n)))

where L can be decided by a TM using O(s(n)) space and there does not exist a TM that decides L using o(s(n)) space.

Recall the difference between “little-o” and “big-O” for bounding a function. f=O(g) means that f grows at most the rate of g, up to constants; f=o(g) means that f grows strictly slower than g asymptotically. More precisely,

f(n) = O(g(n)) means \exists n_0, c>0 : \forall n \ge n_0, f(n) \le c \cdot g(n)

f(n) = o(g(n)) means \forall c>0, \exists n_0 : \forall n \ge n_0, f(n) \le c\cdot g(n)

For example, the Space Hierarchy theorem implies that a language recognizable by a TM using n^2 space exists where no TM recognizes that language using n^{1.99} space.

We can now prove the Space Hierarchy theorem.

Proof: (of Space Hierarchy theorem)

Let L be language accepted by following turing machine D_L. On input x

  1. Mark s(n) space on the tape.
  2. Keep counter for number of steps taken by D_L
  3. If x \neq <M>, 1^{n}, reject
  4. If $M$ is not valid turing machine, reject
  5. Run M on x
    • If space used exceeds s(n), reject
    • If time used exceeds 2^{s(n)}, reject
  6. Else
    1. If M accepts, reject.
    2. If M rejects, accept.

By definition, D_L always terminates, and decides L using at most O(s(n)) space, and so L \in SPACE(O(s(n)))

Claim 1: No TM using o(s(n)) space can decide L.

Proof: Suppose there exists such a TM M, where M decides L using o(s(n)) space. Run D_L on (\langle M\rangle, 1^n). Since O(s(n)) space is used, simulation will terminate. Note that D_L accepts if and only if M rejects. As such, M does not accept L, which yields a contradiction.

The following theorem gives a similar result as the Space Hierarchy Theorem for time, but with a \log factor.

Definition: A function t: \mathbb{N} \rightarrow \mathbb{N}, where t(n) \ge n \log n, is time-constructible if there exists a TM M which on input 1^n outputs t(n) in binary and uses O(t(n)) time.

Theorem (Time Hierarchy Theorem): For any time-constructible function t(n), there exists a language L such that

L \in TIME(O(t(n)))     L \notin TIME(o(\frac{t(n)}{\log(t(n))}))

L can be decided by a TM using O(t(n)) time and cannot be decided by any TM using o(\frac{t(n)}{\log(t(n))}) time.


Proof (of Time Hierarchy Theorem): Consider the following TM B that, in time O(t(n)), decides a language L which we will show is not decidable in o(\frac{t(n)}{\log t(n)}) time.


  1. Let n be the length of w.
  2. Compute t(n) using time constructibility, and store the value \lceil t(n)/\log t(n)\rceil in a binary counter. Decrement this counter before each step used in stages 3, 4, and 5. If the counter ever hits 0, REJECT.
  3. If w is not of the form \langle M \rangle 10^* for some TM M, REJECT.
  4. Simulate M on w.
  5. If M accepts, then REJECT. If M rejects, then ACCEPT.

We first need to show that B runs in O(t(n)) steps. Stages 1-3 can be completed in time O(t(n)) by the definition of time-constructibility.

For stages 4 and 5, we need to simulate the operation of M. To simulate M, we store the current state of the machine, the current tape, and the transition function. These objects will need to be constantly updated throughout the simulation, and need to be performed efficiently. We separate B‘s tape into tracks (e.g. even and odd positions), where on one track we store M‘s tape, and on another track, we store the current state of M along with a copy of M‘s transition function.

To keep the simulation efficient, we maintain that the description of M remains near the head of the TM. Whenever the position of M‘s head moves, we move the entire description of M along the second track. Because we consider the size of M to be a constant (not dependent on the input size provided the M), this only adds a constant overhead. Thus, we can simulate the operation of M with only constant overhead: if M runs in g(n) time, then B can simulate it in O(g(n)) time.

When B simulates the operation of M, it must also keep track of the counter at each step. We use a third track to store and update the counter. The length of this counter will be O(\log\big[t(n)/\log t(n)\big]) = O(\log t(n)). Thus updating this counter will require a O(\log t(n)) multiplicative overhead to the simulation of M, which means that the language L we constructed is decidable in O(t(n)) time.

Finally we need to show that no TM can decide L in time o(t(n)/\log t(n)). Suppose for contradiction there existed a TM T that decided L in time o(t(n)/\log t(n)). Since both T and B decide the same language, they should agree on all inputs. Run the TM B on input \langle T \rangle 10^k for k sufficiently large. B will simulate T(\langle T \rangle 10^k) in o(f(n)) steps, and will disagree with T on the same input. Thus the TM T cannot exist.

Nondeterminism & Savitch’s Theorem


We define a nondeterministic TM (NTM) as a set of states Q, input alphabet \Sigma, tape alphabet \Gamma, and a transition function \delta where

\delta: Q\times \Gamma \rightarrow (Q \times \Gamma \times \{L,R\})*

So \delta(q,\gamma) = \{(q_1, \gamma_1, L), (q_2, \gamma_2, L), (q_3, \gamma_3, R), \ldots \}. Instead of a single transition state, we now have a set of transition states.

We then say a NTM accepts input x if there exists ANY possible path to an accept state. There could be many paths to reject states, but as long as there is one path to an accept state, the NTM accepts.

If we consider the transition tree of a NTM, NTIME is the depth of the tree, while DTIME is the size of the tree. Formalizing this, we obtain the following theorem.

Theorem NTIME(t(n)) \subseteq DTIME(C^{t(n)}).

Proof: Do a breadth-first search on the transition graph, and accept when we reach an accept state. We take the constant C to be the maximum number of states in a transition function. The total time is C^{t(n)} = 2^{O(t(n))}.

Note that this implies that nondeterminism does not grant the power to solve any additional problems that deterministic algorithms cannot. However, nondeterminism may decrease the amount of time or space needed.


What about space? If a NTM takes space s(n), what will a DTM take? C^{s(n)}? This is as much time as it could take. (Why?)

A specific problem: \exists a path in G from s to t? If |V|=n, then NTM needs O(\log n) space by guessing the next vertex. What about a DTM?

Theorem: Graph connectivity \in DSPACE(O(\log^2n)).

Proof: Construct the following TM:

  • PATH(a, b, k):
    • if k=0: if a=b then accept, else reject
    • if k=1, if (a,b)\in E or a=b then accept, else reject
    • else:
      • for each c\neq a,b in V:
        • if PATH(a,c, \lfloor \frac{k}{2}\rfloor) accepts, AND PATH(c,b, \lceil \frac{k}{2}\rceil) accepts, then accept
      • reject

PATH(a,b,k) accepts if \exists a path of length \le k between a and b in graph G=(V,E), and rejects otherwise.

It needs space O(\log n) per level of recursion to store the names. There are \log_2k\le \log_2n levels of recursion. So total space needed is O(\log^2n).

Theorem [Savitch]: NSPACE(s(n))\subseteq DSPACE(O(s(n)^2)).

Proof: Consider the graph of all possible configurations. The number of vertices is |V|=|Q|\times (s(n) + n) \times |\Gamma|^{s(n)}. If \exists path from the starting configuration to the ending configuration, it is of length \le |V|-1. The space needed by a DTM to check if such path exists is O(\log^2|V|)=O((s(n)+\log(s(n)))^2)=O(s(n)^2).