Models for Chains and Trees¶

MAP Inference for Chains¶

Any minimization problem of the form (11.9) can be solved using the Viterbi algorithm (see Figure 11.3).
- Reformulate the problem as finding the minimum cost path.
- Factorizing the joint probability distribution (using conditional independence structure) leads to efficient inference.
  - See (11.1) and (11.2).
  - The forward-backward algorithm uses conditional independence instead of factorization.

Exercise 11.1¶

Notice that only the costs for \(\{ V_{n, 2} \}_{n = 1}^N\) have changed compared to the example in the book. This exercise is a bit too tedious for those that have existing experience with dynamic programmming.

Exercise 11.2¶

The Viterbi algorithm takes \(\mathcal{O}\left( N K^2 \right)\) for a model with \(N\) variables, each of which takes on \(K\) possible values. Reformulating this as a graph \(G\), there are \(|V| = NK\) nodes and \(|E| = (N - 1) K^2\) directed edges.

Dijkstra’s algorithm finds the shortest paths between nodes in a graph. As illustrated in Figure 11.23, \(G\) needs to be modified to have an additional two nodes and \(2K\) edges i.e. \(G' = (V', E')\) where \(\left\vert V' \right\vert = NK + 2\) and \(\left\vert E' \right\vert = (N - 1) K^2 + 2K\).

Sleator-Tarjan have shown that using a Fibonacci heap achieves a worst-case analysis of

\[\mathcal{O}\left( \left\vert E' \right\vert + \left\vert V' \right\vert \log \left\vert V' \right\vert \right) = \mathcal{O}\left( (N - 1) K^2 + 2K + (NK + 2) \log(NK + 2) \right).\]

For DAGs, performing a topological sort and looping over the result relaxing the edge weights takes

\[\mathcal{O}\left( \left\vert E' \right\vert + \left\vert V' \right\vert \right) = \mathcal{O}\left( (N - 1) K^2 + 2K + (NK + 2) \right).\]

SSSP outperforms dynamic programming when \(K \gg N\).

Exercise 11.3¶

The overall joint probability is

\[\begin{split}Pr(\mathbf{x}_{1 \ldots 6}, w_{1 \ldots 6}) &= Pr(\mathbf{x}_{1 \ldots 6} \mid w_{1 \ldots 6}) Pr(w_{1 \ldots 6})\\ &= \prod_{i = 1}^6 Pr(\mathbf{x}_i \mid w_i) \cdot Pr(w_6) Pr(w_5 \mid w_6) Pr(w_2 \mid w_5) Pr(w_4 \mid w_5) Pr(w_1 \mid w_2) Pr(w_3 \mid w_4).\end{split}\]

The cost function for the MAP estimation is

\[\begin{split}\DeclareMathOperator*{\argmax}{arg\,max}\\ \DeclareMathOperator*{\argmin}{arg\,min}\\ \argmax_{\mathbf{w}} Pr(w_{1 \ldots 6} \mid \mathbf{x}_{1 \ldots 6}) &= \argmin_{\mathbf{w}} -\log Pr(\mathbf{x}_{1 \ldots 6}, w_{1 \ldots 6}) & \quad & \text{(11.7)}\\ &= \argmin_{\mathbf{w}} \left( \sum_{n = 1}^6 U_n(w_n) \right) - \log Pr(w_6) + P_5(w_5, w_6) + P_2(w_2, w_5) + P_4(w_4, w_5) + P_1(w_1, w_2) + P_3(w_3, w_4) & \quad & \text{(11.10).}\end{split}\]

Notice how reversing the graphical model removed the three-wise cost \(T_5\) i.e. the largest number of incoming connections is reduced to one.

Exercise 11.4¶

This exercise is a bit too tedious for those that have existing experience with dynamic programmming. The only novel item is (11.19), but that’s just a 3D lookup table.

Exercise 11.5¶

\[\begin{split}\newcommand\qqquad{\hskip3em} \begin{align} \hat{w}_N &= \argmax_{w_N} \log Pr(\mathbf{x}_N \mid w_N) + \max_{w_1} \log Pr(\mathbf{x}_1 \mid w_1) + \max_{w_2} \log Pr(\mathbf{x}_2 \mid w_2) + \log Pr(w_2 \mid w_1) +\\ &\qquad \max_{w_3} \log Pr(\mathbf{x}_3 \mid w_3) + \log Pr(w_3 \mid w_2) + \ldots\\ &\qqquad \max_{w_{N - 1}} \log Pr(\mathbf{x}_{N - 1} \mid w_{N - 1}) + \log Pr(w_{N - 1} \mid w_{N - 2}) + \log Pr(w_N \mid w_{N - 1})\\ &= \argmax_{w_N} \log Pr(\mathbf{x}_N \mid w_N) + \max_{w_{N - 1}} \log Pr(w_N \mid w_{N - 1}) + \log Pr(\mathbf{x}_{N - 1} \mid w_{N - 1}) +\\ &\qquad \max_{w_{N - 2}} \log Pr(w_{N - 1} \mid w_{N - 2}) + \log Pr(\mathbf{x}_{N - 2} \mid w_{N - 2}) + \ldots\\ &\qqquad \max_{w_2} \log Pr(w_3 \mid w_2) + \log Pr(\mathbf{x}_2 \mid w_2) + \max_{w_1} \log Pr(w_2 \mid w_1) + \log Pr(\mathbf{x}_1 \mid w_1)\\ &= \argmax_{w_N} f_N[w_N] \end{align}\end{split}\]

where

\[f_n[w_n] = \log Pr(\mathbf{x}_n \mid w_n) + \max_{w_{n - 1}} \log Pr(w_n \mid w_{n - 1}) + f_{n - 1}[w_{n - 1}]\]

and

\[f_1[w_1] = \log Pr(\mathbf{x}_1 \mid w_1).\]

Exercise 11.6¶

Given the joint probability

\[Pr(\mathbf{x}_{1 \ldots N}, w_{1 \ldots N}) = \prod_{n = 1}^N Pr(\mathbf{x}_n \mid w_n) \cdot Pr(w_1) \prod_{n = 2}^N Pr(w_n \mid w_{n - 1}),\]

the marginal distribution for any arbitrary variable \(w_n\) is

\[\begin{split}\begin{align} Pr(w_n \mid \mathbf{x}_{1 \ldots N}) &= \frac{ Pr(w_n, \mathbf{x}_{1 \ldots N}) }{ Pr(\mathbf{x}_{1 \ldots N}) }\\ &\propto Pr(w_n, \mathbf{x}_{1 \ldots N})\\ &= \sum_{w_1} \cdots \sum_{w_{n - 1}} \sum_{w_{n + 1}} \cdots \sum_{w_N} Pr(\mathbf{x}_{1 \ldots N}, w_{1 \ldots N})\\ &= Pr(\mathbf{x}_n \mid w_n) \sum_{w_1} Pr(\mathbf{x}_1 \mid w_1) Pr(w_1) \sum_{w_2} Pr(\mathbf{x}_2 \mid w_2) Pr(w_2 \mid w_1) \cdots\\ &\qquad \sum_{w_{n - 1}} Pr(\mathbf{x}_{n - 1} \mid w_{n - 1}) Pr(w_{n - 1} \mid w_{n - 2}) Pr(w_n \mid w_{n - 1}) \sum_{w_{n + 1}} Pr(\mathbf{x}_{n + 1} \mid w_{n + 1}) Pr(w_{n + 1} \mid w_n) \cdots\\ &\qqquad \sum_{w_N} Pr(\mathbf{x}_N \mid w_N) Pr(w_N \mid w_{N - 1})\\ &= Pr(\mathbf{x}_n \mid w_n) \alpha_{n - 1}[w_n] \beta_{n + 1}[w_n] & \quad & \text{(a) and (b)} \end{align}\end{split}\]

(a)¶

\[\begin{split}& \sum_{w_1} Pr(\mathbf{x}_1 \mid w_1) Pr(w_1) \sum_{w_2} Pr(\mathbf{x}_2 \mid w_2) Pr(w_2 \mid w_1) \cdots \sum_{w_{n - 1}} Pr(\mathbf{x}_{n - 1} \mid w_{n - 1}) Pr(w_{n - 1} \mid w_{n - 2}) Pr(w_n \mid w_{n - 1})\\ &= \sum_{w_{n - 1}} Pr(\mathbf{x}_{n - 1} \mid w_{n - 1}) Pr(w_n \mid w_{n - 1}) \sum_{w_{n - 2}} Pr(\mathbf{x}_{n - 2} \mid w_{n - 2}) Pr(w_{n - 1} \mid w_{n - 2}) \cdots \sum_{w_1} Pr(\mathbf{x}_1 \mid w_1) Pr(w_2 \mid w_1) Pr(w_1)\\ &= \alpha_{n - 1}[w_n]\end{split}\]

where

\[\alpha_{n - 1}[w_n] = \sum_{w_{n - 1}} Pr(\mathbf{x}_{n - 1} \mid w_{n - 1}) Pr(w_n \mid w_{n - 1}) \alpha_{n - 2}[w_{n - 1}]\]

and

\[\alpha_1[w_2] = \sum_{w_1} Pr(\mathbf{x}_1 \mid w_1) Pr(w_2 \mid w_1) Pr(w_1).\]

(b)¶

\[\sum_{w_{n + 1}} Pr(\mathbf{x}_{n + 1} \mid w_{n + 1}) Pr(w_{n + 1} \mid w_n) \sum_{w_{n + 2}} Pr(\mathbf{x}_{n + 2} \mid w_{n + 2}) Pr(w_{n + 2} \mid w_{n + 1}) \cdots \sum_{w_N} Pr(\mathbf{x}_N \mid w_N) Pr(w_N \mid w_{N - 1}) = \beta_{n + 1}[w_n]\]

where

\[\beta_{n + 1}[w_n] = \sum_{w_{n + 1}} Pr(\mathbf{x}_{n + 1} \mid w_{n + 1}) Pr(w_{n + 1} \mid w_n) \beta_{n + 2}[w_{n + 1}]\]

and

\[\beta_N[w_{N - 1}] = \sum_{w_N} Pr(\mathbf{x}_N \mid w_N) Pr(w_N \mid w_{N - 1}).\]

Exercise 11.7¶

Given the joint probability

\[Pr(\mathbf{x}_{1 \ldots N}, w_{1 \ldots N}) = \prod_{n = 1}^N Pr(\mathbf{x}_n \mid w_n) \cdot Pr(w_1) \prod_{n = 2}^N Pr(w_n \mid w_{n - 1}),\]

the marginal distribution for any arbitrary variable pairs \(w_m\) and \(w_n\) where \(1 \leq m < n \leq N\) is

\[\begin{split}Pr(w_m, w_n \mid \mathbf{x}_{1 \ldots N}) &= \frac{ Pr(w_m, w_n, \mathbf{x}_{1 \ldots N}) }{ Pr(\mathbf{x}_{1 \ldots N}) }\\ &\propto Pr(w_m, w_n, \mathbf{x}_{1 \ldots N})\\ &= \sum_{w_1} \cdots \sum_{w_{m - 1}} \sum_{w_{m + 1}} \cdots \sum_{w_{n - 1}} \sum_{w_{n + 1}} \cdots \sum_{w_N} Pr(\mathbf{x}_{1 \ldots N}, w_{1 \ldots N})\\ &= Pr(\mathbf{x}_m \mid w_m) Pr(\mathbf{x}_n \mid w_n) \sum_{w_1} Pr(\mathbf{x}_1 \mid w_1) Pr(w_1) \sum_{w_2} Pr(\mathbf{x}_2 \mid w_2) Pr(w_2 \mid w_1) \cdots\\ &\qquad \sum_{w_{m - 1}} Pr(\mathbf{x}_{m - 1} \mid w_{m - 1}) Pr(w_{m - 1} \mid w_{m - 2}) Pr(w_m \mid w_{m - 1}) \sum_{w_{m + 1}} Pr(\mathbf{x}_{m + 1} \mid w_{m + 1}) Pr(w_{m + 1} \mid w_m) \cdots\\ &\qqquad \sum_{w_{n - 1}} Pr(\mathbf{x}_{n - 1} \mid w_{n - 1}) Pr(w_{n - 1} \mid w_{n - 2}) Pr(w_n \mid w_{n - 1}) \sum_{w_{n + 1}} Pr(\mathbf{x}_{n + 1} \mid w_{n + 1}) Pr(w_{n + 1} \mid w_n) \cdots \sum_{w_N} Pr(\mathbf{x}_N \mid w_N) Pr(w_N \mid w_{N - 1})\\ &= Pr(\mathbf{x}_m \mid w_m) Pr(\mathbf{x}_n \mid w_n) \alpha_{m - 1}[w_m] \gamma[w_m, w_n] \beta_{n + 1}[w_n] & \quad & \text{(a), (b), and (c)}\end{split}\]

(a)¶

\[\begin{split}& \sum_{w_1} Pr(\mathbf{x}_1 \mid w_1) Pr(w_1) \sum_{w_2} Pr(\mathbf{x}_2 \mid w_2) Pr(w_2 \mid w_1) \cdots \sum_{w_{m - 1}} Pr(\mathbf{x}_{m - 1} \mid w_{m - 1}) Pr(w_{m - 1} \mid w_{m - 2}) Pr(w_m \mid w_{m - 1})\\ &= \sum_{w_{m - 1}} Pr(\mathbf{x}_{m - 1} \mid w_{m - 1}) Pr(w_m \mid w_{m - 1}) \sum_{w_{m - 2}} Pr(\mathbf{x}_{m - 2} \mid w_{m - 2}) Pr(w_{m - 1} \mid w_{m - 2}) \cdots \sum_{w_1} Pr(\mathbf{x}_1 \mid w_1) Pr(w_2 \mid w_1) Pr(w_1)\\ &= \alpha_{m - 1}[w_m]\end{split}\]

where

\[\begin{split}\alpha_{m - 1}[w_m] &= \sum_{w_{m - 1}} Pr(\mathbf{x}_{m - 1} \mid w_{m - 1}) Pr(w_m \mid w_{m - 1}) \alpha_{m - 2}[w_{m - 1}],\\\\\\ \alpha_1[w_2] &= \sum_{w_1} Pr(\mathbf{x}_1 \mid w_1) Pr(w_2 \mid w_1) Pr(w_1),\\\\\\ \alpha_0[w_1] &= Pr(w_1).\end{split}\]

(b)¶

\[\beta_{n + 1}[w_n] = \sum_{w_{n + 1}} Pr(\mathbf{x}_{n + 1} \mid w_{n + 1}) Pr(w_{n + 1} \mid w_n) \cdots \sum_{w_N} Pr(\mathbf{x}_N \mid w_N) Pr(w_N \mid w_{N - 1})\]

where

\[\begin{split}\beta_{n + 1}[w_n] &= \sum_{w_{n + 1}} Pr(\mathbf{x}_{n + 1} \mid w_{n + 1}) Pr(w_{n + 1} \mid w_n) \beta_{n + 2}[w_{n + 1}],\\\\\\ \beta_N[w_{N - 1}] &= \sum_{w_N} Pr(\mathbf{x}_N \mid w_N) Pr(w_N \mid w_{N - 1}),\\\\\\ \beta_{N + 1}[w_N] &= 1.\end{split}\]

(c)¶

\[\gamma[w_m, w_n] = \sum_{w_{m + 1}} Pr(\mathbf{x}_{m + 1} \mid w_{m + 1}) Pr(w_{m + 1} \mid w_m) \cdots \sum_{w_{n - 1}} Pr(\mathbf{x}_{n - 1} \mid w_{n - 1}) Pr(w_{n - 1} \mid w_{n - 2}) Pr(w_n \mid w_{n - 1})\]

Notice that

\[\begin{split}Pr(w_m, w_n \mid \mathbf{x}_{1 \ldots N}) &= Pr(w_m \mid \mathbf{x}_{1 \ldots N}) Pr(w_n \mid w_m, \mathbf{x}_{1 \ldots N})\\ &= Pr(w_m \mid \mathbf{x}_{1 \ldots N}) Pr(w_n \mid w_m, \mathbf{x}_{(m + 1) \ldots N})\end{split}\]

is reduced to \(Pr(w_m \mid \mathbf{x}_{1 \ldots N})\) when \(m = n\) i.e. \(\gamma = 1\).

When \(n = m + 1\), the expression reduces to \(\gamma = Pr(w_n \mid w_m)\) because \(w_n\) and \(w_m\) are not included in the marginalization.

The derivation of the last case when \(n > m + 1\) is similar to (a):

\[\begin{split}& \sum_{w_{m + 1}} Pr(\mathbf{x}_{m + 1} \mid w_{m + 1}) Pr(w_{m + 1} \mid w_m) \cdots \sum_{w_{n - 1}} Pr(\mathbf{x}_{n - 1} \mid w_{n - 1}) Pr(w_{n - 1} \mid w_{n - 2}) Pr(w_n \mid w_{n - 1})\\ &= \sum_{w_{n - 1}} Pr(\mathbf{x}_{n - 1} \mid w_{n - 1}) Pr(w_n \mid w_{n - 1}) \sum_{w_{n - 2}} Pr(\mathbf{x}_{n - 2} \mid w_{n - 2}) Pr(w_{n - 1} \mid w_{n - 2}) \cdots \sum_{w_{m + 1}} Pr(\mathbf{x}_{m + 1} \mid w_{m + 1}) Pr(w_{m + 2} \mid w_{m + 1}) Pr(w_{m + 1} \mid w_m)\\ &= \gamma_{n - 1}[w_n]\end{split}\]

where

\[\gamma_{n - 1}[w_n] = \sum_{w_{n - 1}} Pr(\mathbf{x}_{n - 1} \mid w_{n - 1}) Pr(w_n \mid w_{n - 1}) \gamma_{n - 2}[w_{n - 1}]\]

and

\[\gamma_{m + 1}[w_{m + 2}] = \sum_{w_{m + 1}} Pr(\mathbf{x}_{m + 1} \mid w_{m + 1}) Pr(w_{m + 2} \mid w_{m + 1}) Pr(w_{m + 1} \mid w_m).\]

Exercise 11.8¶

One can conclude that factor graphs enable one to determine the maximal cliques in undirected graphical model.

(1)¶

\[Pr(x_1, x_2, x_3) = \frac{1}{Z_1} \phi_{12}[x_1, x_2] \phi_{23}[x_2, x_3] \phi_{31}[x_3, x_1]\]

(1.a)¶

\[\begin{split}(x_1) \leftrightarrow (x_2)\\ (x_2) \leftrightarrow (x_3)\\ (x_3) \leftrightarrow (x_1)\end{split}\]

(1.b)¶

\[\begin{split}(x_1) -[A]- (x_2)\\ (x_2) -[B]- (x_3)\\ (x_3) -[C]- (x_1)\end{split}\]

(2)¶

\[Pr(x_1, x_2, x_3) = \frac{1}{Z_2} \phi_{123}[x_1, x_2, x_3]\]

(2.a)¶

\[(x_1) \leftrightarrow (x_2) \leftrightarrow (x_3)\]

(2.b)¶

\[\begin{split}(x_1) -[A]-\\ (x_2) -[A]-\\ (x_3) -[A]-\end{split}\]

Exercise 11.9¶

[Bro] was very useful in answering how to define the factors of a PDF for undirected graphical models. In short, anything goes as long as the set of potential functions \(\phi\) for a particular factorization account for the structure of the graph (e.g. edges being shared between cliques).

According to [KFL01], factor graphs of single components are allowed. This makes sense because the independence of the nodes are lost otherwise.

(a) and (d) are the only ones that do not have a cycle i.e. takes the form of a chain.

(a)¶

\[Pr(w_{1 \ldots 6}) = Pr(w_2) Pr(w_6) Pr(w_1 | w_2) Pr(w_5 | w_2, w_6) Pr(w_4 | w_5) Pr(w_3 | w_4)\]

\[\begin{split}(w_2) -[A]\\ (w_6) -[B]\\ (w_1) -[C]- (w_2)\\ (w_5) -[D]-\\ (w_2) -[D]-\\ (w_6) -[D]-\\ (w_4) -[E]- (w_5)\\ (w_3) -[F]- (w_4)\end{split}\]

(b)¶

\[Pr(w_{1 \ldots 6}) = \frac{1}{Z} \phi_1[w_1, w_2, w_4] \phi_2[w_1, w_3, w_4] \phi_3[w_2, w_4, w_5] \phi_4[w_5, w_6]\]

\[\begin{split}(w_1) -[A]-\\ (w_2) -[A]-\\ (w_4) -[A]-\\ (w_1) -[B]-\\ (w_3) -[B]-\\ (w_4) -[B]-\\ (w_2) -[C]-\\ (w_4) -[C]-\\ (w_5) -[C]-\\ (w_5) -[D]- (w_6)\end{split}\]

(c)¶

\[Pr(w_{1 \ldots 6}) = Pr(w_2) Pr(w_6) Pr(w_1 | w_2) Pr(w_5 | w_2, w_6) Pr(w_4 | w_1, w_5) Pr(w_3 | w_4)\]

\[\begin{split}(w_2) -[A]-\\ (w_6) -[B]-\\ (w_1) -[C]- (w_2)\\ (w_5) -[D]-\\ (w_2) -[D]-\\ (w_6) -[D]-\\ (w_4) -[E]-\\ (w_1) -[E]-\\ (w_5) -[E]-\\ (w_3) -[F]- (w_4)\end{split}\]

(d)¶

\[Pr(w_{1 \ldots 6}) = \frac{1}{Z} \phi_1[w_1, w_3, w_4] \phi_2[w_2, w_4, w_5] \phi_3[w_5, w_6]\]

\[\begin{split}(w_1) -[A]-\\ (w_2) -[A]-\\ (w_4) -[A]-\\ (w_2) -[B]-\\ (w_4) -[B]-\\ (w_5) -[B]-\\ (w_5) -[C]- (w_6)\end{split}\]

Exercise 11.10¶

Notice that each variable \(w_i\) is connected to \(w_{i + 1}\) and \(w_{i + 2}\). To reduce the chain model to one dependent predecessor, define \(w'_j = \{ w_{2j - 1}, w_{2j} \}\) for all \(j = [1, 2, \ldots, N / 2]\).

According to Exercise 11.2, a HMM with \(N\) variables each taking \(K\) values has a time complexity of \(\mathcal{O}\left( N K^2 \right)\) using the Viterbi algorithm.

The combined chain model will have \(N / 2\) variables each taking \(K^2\) values hence the overall time complexity is \(\mathcal{O}\left( N K^4 \right)\).

Exercise 11.11¶

Instead of organizing each scanline into a chain model, a \(n \times n\) grid model could be imposed over each pixel in the scanline of each image.

Exercise 11.12¶

Run belief propagation on the factor graph [Wei00].

References

Bro: Matthew Brown. Probabilistic graphical models - undirected graphical models. http://webdocs.cs.ualberta.ca/ greiner/C-651/SLIDES/MB01_UndirectedModels1.pdf. Accessed on 2017-07-16.
KFL01: Frank R Kschischang, Brendan J Frey, and H-A Loeliger. Factor graphs and the sum-product algorithm. IEEE Transactions on information theory, 47(2):498–519, 2001.
Wei00: Yair Weiss. Correctness of local probability propagation in graphical models with loops. Neural computation, 12(1):1–41, 2000.