Models for Chains and Trees
Exercise 11.1
Notice that only the costs for \(\{ V_{n, 2} \}_{n = 1}^N\) have changed
compared to the example in the book. This exercise is a bit too tedious for
those that have existing experience with dynamic programmming.
Exercise 11.2
The Viterbi algorithm takes \(\mathcal{O}\left( N K^2 \right)\) for a model
with \(N\) variables, each of which takes on \(K\) possible values.
Reformulating this as a graph \(G\), there are \(|V| = NK\) nodes and
\(|E| = (N - 1) K^2\) directed edges.
Dijkstra’s algorithm finds the shortest paths between nodes in a graph. As
illustrated in Figure 11.23, \(G\) needs to be modified to have an
additional two nodes and \(2K\) edges i.e. \(G' = (V', E')\) where
\(\left\vert V' \right\vert = NK + 2\) and
\(\left\vert E' \right\vert = (N - 1) K^2 + 2K\).
Sleator-Tarjan have shown that using a Fibonacci heap achieves a worst-case
analysis of
\[\mathcal{O}\left(
\left\vert E' \right\vert +
\left\vert V' \right\vert \log \left\vert V' \right\vert
\right) =
\mathcal{O}\left( (N - 1) K^2 + 2K + (NK + 2) \log(NK + 2) \right).\]
For DAGs, performing a topological sort and looping over the result relaxing
the edge weights takes
\[\mathcal{O}\left(
\left\vert E' \right\vert + \left\vert V' \right\vert
\right) =
\mathcal{O}\left( (N - 1) K^2 + 2K + (NK + 2) \right).\]
SSSP outperforms dynamic programming when \(K \gg N\).
Exercise 11.3
The overall joint probability is
\[\begin{split}Pr(\mathbf{x}_{1 \ldots 6}, w_{1 \ldots 6})
&= Pr(\mathbf{x}_{1 \ldots 6} \mid w_{1 \ldots 6})
Pr(w_{1 \ldots 6})\\
&= \prod_{i = 1}^6 Pr(\mathbf{x}_i \mid w_i) \cdot
Pr(w_6) Pr(w_5 \mid w_6) Pr(w_2 \mid w_5) Pr(w_4 \mid w_5)
Pr(w_1 \mid w_2) Pr(w_3 \mid w_4).\end{split}\]
The cost function for the MAP estimation is
\[\begin{split}\DeclareMathOperator*{\argmax}{arg\,max}\\
\DeclareMathOperator*{\argmin}{arg\,min}\\
\argmax_{\mathbf{w}}
Pr(w_{1 \ldots 6} \mid \mathbf{x}_{1 \ldots 6})
&= \argmin_{\mathbf{w}}
-\log Pr(\mathbf{x}_{1 \ldots 6}, w_{1 \ldots 6})
& \quad & \text{(11.7)}\\
&= \argmin_{\mathbf{w}}
\left( \sum_{n = 1}^6 U_n(w_n) \right) -
\log Pr(w_6) + P_5(w_5, w_6) + P_2(w_2, w_5) + P_4(w_4, w_5) +
P_1(w_1, w_2) + P_3(w_3, w_4)
& \quad & \text{(11.10).}\end{split}\]
Notice how reversing the graphical model removed the three-wise cost
\(T_5\) i.e. the largest number of incoming connections is reduced to one.
Exercise 11.4
This exercise is a bit too tedious for those that have existing experience with
dynamic programmming. The only novel item is (11.19), but that’s just a 3D
lookup table.
Exercise 11.5
\[\begin{split}\newcommand\qqquad{\hskip3em}
\begin{align}
\hat{w}_N
&= \argmax_{w_N} \log Pr(\mathbf{x}_N \mid w_N) +
\max_{w_1} \log Pr(\mathbf{x}_1 \mid w_1) +
\max_{w_2} \log Pr(\mathbf{x}_2 \mid w_2) +
\log Pr(w_2 \mid w_1) +\\
&\qquad \max_{w_3} \log Pr(\mathbf{x}_3 \mid w_3) +
\log Pr(w_3 \mid w_2) + \ldots\\
&\qqquad \max_{w_{N - 1}} \log Pr(\mathbf{x}_{N - 1} \mid w_{N - 1}) +
\log Pr(w_{N - 1} \mid w_{N - 2}) +
\log Pr(w_N \mid w_{N - 1})\\
&= \argmax_{w_N} \log Pr(\mathbf{x}_N \mid w_N) +
\max_{w_{N - 1}} \log Pr(w_N \mid w_{N - 1}) +
\log Pr(\mathbf{x}_{N - 1} \mid w_{N - 1}) +\\
&\qquad \max_{w_{N - 2}} \log Pr(w_{N - 1} \mid w_{N - 2}) +
\log Pr(\mathbf{x}_{N - 2} \mid w_{N - 2}) + \ldots\\
&\qqquad \max_{w_2} \log Pr(w_3 \mid w_2) +
\log Pr(\mathbf{x}_2 \mid w_2) +
\max_{w_1} \log Pr(w_2 \mid w_1) +
\log Pr(\mathbf{x}_1 \mid w_1)\\
&= \argmax_{w_N} f_N[w_N]
\end{align}\end{split}\]
where
\[f_n[w_n] =
\log Pr(\mathbf{x}_n \mid w_n) +
\max_{w_{n - 1}} \log Pr(w_n \mid w_{n - 1}) + f_{n - 1}[w_{n - 1}]\]
and
\[f_1[w_1] = \log Pr(\mathbf{x}_1 \mid w_1).\]
Exercise 11.6
Given the joint probability
\[Pr(\mathbf{x}_{1 \ldots N}, w_{1 \ldots N}) =
\prod_{n = 1}^N Pr(\mathbf{x}_n \mid w_n) \cdot
Pr(w_1) \prod_{n = 2}^N Pr(w_n \mid w_{n - 1}),\]
the marginal distribution for any arbitrary variable \(w_n\) is
\[\begin{split}\begin{align}
Pr(w_n \mid \mathbf{x}_{1 \ldots N})
&= \frac{
Pr(w_n, \mathbf{x}_{1 \ldots N})
}{
Pr(\mathbf{x}_{1 \ldots N})
}\\
&\propto Pr(w_n, \mathbf{x}_{1 \ldots N})\\
&= \sum_{w_1} \cdots \sum_{w_{n - 1}} \sum_{w_{n + 1}} \cdots \sum_{w_N}
Pr(\mathbf{x}_{1 \ldots N}, w_{1 \ldots N})\\
&= Pr(\mathbf{x}_n \mid w_n)
\sum_{w_1} Pr(\mathbf{x}_1 \mid w_1) Pr(w_1)
\sum_{w_2} Pr(\mathbf{x}_2 \mid w_2) Pr(w_2 \mid w_1) \cdots\\
&\qquad \sum_{w_{n - 1}} Pr(\mathbf{x}_{n - 1} \mid w_{n - 1})
Pr(w_{n - 1} \mid w_{n - 2}) Pr(w_n \mid w_{n - 1})
\sum_{w_{n + 1}} Pr(\mathbf{x}_{n + 1} \mid w_{n + 1})
Pr(w_{n + 1} \mid w_n) \cdots\\
&\qqquad \sum_{w_N} Pr(\mathbf{x}_N \mid w_N) Pr(w_N \mid w_{N - 1})\\
&= Pr(\mathbf{x}_n \mid w_n) \alpha_{n - 1}[w_n] \beta_{n + 1}[w_n]
& \quad & \text{(a) and (b)}
\end{align}\end{split}\]
(a)
\[\begin{split}& \sum_{w_1} Pr(\mathbf{x}_1 \mid w_1) Pr(w_1)
\sum_{w_2} Pr(\mathbf{x}_2 \mid w_2) Pr(w_2 \mid w_1) \cdots
\sum_{w_{n - 1}} Pr(\mathbf{x}_{n - 1} \mid w_{n - 1})
Pr(w_{n - 1} \mid w_{n - 2}) Pr(w_n \mid w_{n - 1})\\
&= \sum_{w_{n - 1}} Pr(\mathbf{x}_{n - 1} \mid w_{n - 1})
Pr(w_n \mid w_{n - 1})
\sum_{w_{n - 2}} Pr(\mathbf{x}_{n - 2} \mid w_{n - 2})
Pr(w_{n - 1} \mid w_{n - 2}) \cdots
\sum_{w_1} Pr(\mathbf{x}_1 \mid w_1) Pr(w_2 \mid w_1) Pr(w_1)\\
&= \alpha_{n - 1}[w_n]\end{split}\]
where
\[\alpha_{n - 1}[w_n] =
\sum_{w_{n - 1}}
Pr(\mathbf{x}_{n - 1} \mid w_{n - 1}) Pr(w_n \mid w_{n - 1})
\alpha_{n - 2}[w_{n - 1}]\]
and
\[\alpha_1[w_2] =
\sum_{w_1} Pr(\mathbf{x}_1 \mid w_1) Pr(w_2 \mid w_1) Pr(w_1).\]
(b)
\[\sum_{w_{n + 1}} Pr(\mathbf{x}_{n + 1} \mid w_{n + 1}) Pr(w_{n + 1} \mid w_n)
\sum_{w_{n + 2}} Pr(\mathbf{x}_{n + 2} \mid w_{n + 2})
Pr(w_{n + 2} \mid w_{n + 1}) \cdots
\sum_{w_N} Pr(\mathbf{x}_N \mid w_N) Pr(w_N \mid w_{N - 1}) =
\beta_{n + 1}[w_n]\]
where
\[\beta_{n + 1}[w_n] =
\sum_{w_{n + 1}}
Pr(\mathbf{x}_{n + 1} \mid w_{n + 1}) Pr(w_{n + 1} \mid w_n)
\beta_{n + 2}[w_{n + 1}]\]
and
\[\beta_N[w_{N - 1}] =
\sum_{w_N} Pr(\mathbf{x}_N \mid w_N) Pr(w_N \mid w_{N - 1}).\]
Exercise 11.7
Given the joint probability
\[Pr(\mathbf{x}_{1 \ldots N}, w_{1 \ldots N}) =
\prod_{n = 1}^N Pr(\mathbf{x}_n \mid w_n) \cdot
Pr(w_1) \prod_{n = 2}^N Pr(w_n \mid w_{n - 1}),\]
the marginal distribution for any arbitrary variable pairs \(w_m\) and
\(w_n\) where \(1 \leq m < n \leq N\) is
\[\begin{split}Pr(w_m, w_n \mid \mathbf{x}_{1 \ldots N})
&= \frac{
Pr(w_m, w_n, \mathbf{x}_{1 \ldots N})
}{
Pr(\mathbf{x}_{1 \ldots N})
}\\
&\propto Pr(w_m, w_n, \mathbf{x}_{1 \ldots N})\\
&= \sum_{w_1} \cdots \sum_{w_{m - 1}} \sum_{w_{m + 1}} \cdots
\sum_{w_{n - 1}} \sum_{w_{n + 1}} \cdots
\sum_{w_N} Pr(\mathbf{x}_{1 \ldots N}, w_{1 \ldots N})\\
&= Pr(\mathbf{x}_m \mid w_m) Pr(\mathbf{x}_n \mid w_n)
\sum_{w_1} Pr(\mathbf{x}_1 \mid w_1) Pr(w_1)
\sum_{w_2} Pr(\mathbf{x}_2 \mid w_2) Pr(w_2 \mid w_1) \cdots\\
&\qquad \sum_{w_{m - 1}} Pr(\mathbf{x}_{m - 1} \mid w_{m - 1})
Pr(w_{m - 1} \mid w_{m - 2}) Pr(w_m \mid w_{m - 1})
\sum_{w_{m + 1}} Pr(\mathbf{x}_{m + 1} \mid w_{m + 1})
Pr(w_{m + 1} \mid w_m) \cdots\\
&\qqquad \sum_{w_{n - 1}} Pr(\mathbf{x}_{n - 1} \mid w_{n - 1})
Pr(w_{n - 1} \mid w_{n - 2}) Pr(w_n \mid w_{n - 1})
\sum_{w_{n + 1}} Pr(\mathbf{x}_{n + 1} \mid w_{n + 1})
Pr(w_{n + 1} \mid w_n) \cdots
\sum_{w_N} Pr(\mathbf{x}_N \mid w_N)
Pr(w_N \mid w_{N - 1})\\
&= Pr(\mathbf{x}_m \mid w_m) Pr(\mathbf{x}_n \mid w_n)
\alpha_{m - 1}[w_m] \gamma[w_m, w_n] \beta_{n + 1}[w_n]
& \quad & \text{(a), (b), and (c)}\end{split}\]
(a)
\[\begin{split}& \sum_{w_1} Pr(\mathbf{x}_1 \mid w_1) Pr(w_1)
\sum_{w_2} Pr(\mathbf{x}_2 \mid w_2) Pr(w_2 \mid w_1) \cdots
\sum_{w_{m - 1}} Pr(\mathbf{x}_{m - 1} \mid w_{m - 1})
Pr(w_{m - 1} \mid w_{m - 2}) Pr(w_m \mid w_{m - 1})\\
&= \sum_{w_{m - 1}} Pr(\mathbf{x}_{m - 1} \mid w_{m - 1})
Pr(w_m \mid w_{m - 1})
\sum_{w_{m - 2}} Pr(\mathbf{x}_{m - 2} \mid w_{m - 2})
Pr(w_{m - 1} \mid w_{m - 2}) \cdots
\sum_{w_1} Pr(\mathbf{x}_1 \mid w_1) Pr(w_2 \mid w_1) Pr(w_1)\\
&= \alpha_{m - 1}[w_m]\end{split}\]
where
\[\begin{split}\alpha_{m - 1}[w_m]
&= \sum_{w_{m - 1}} Pr(\mathbf{x}_{m - 1} \mid w_{m - 1})
Pr(w_m \mid w_{m - 1}) \alpha_{m - 2}[w_{m - 1}],\\\\\\
\alpha_1[w_2]
&= \sum_{w_1} Pr(\mathbf{x}_1 \mid w_1) Pr(w_2 \mid w_1) Pr(w_1),\\\\\\
\alpha_0[w_1] &= Pr(w_1).\end{split}\]
(b)
\[\beta_{n + 1}[w_n] =
\sum_{w_{n + 1}} Pr(\mathbf{x}_{n + 1} \mid w_{n + 1})
Pr(w_{n + 1} \mid w_n) \cdots
\sum_{w_N} Pr(\mathbf{x}_N \mid w_N) Pr(w_N \mid w_{N - 1})\]
where
\[\begin{split}\beta_{n + 1}[w_n]
&= \sum_{w_{n + 1}} Pr(\mathbf{x}_{n + 1} \mid w_{n + 1})
Pr(w_{n + 1} \mid w_n) \beta_{n + 2}[w_{n + 1}],\\\\\\
\beta_N[w_{N - 1}]
&= \sum_{w_N} Pr(\mathbf{x}_N \mid w_N) Pr(w_N \mid w_{N - 1}),\\\\\\
\beta_{N + 1}[w_N] &= 1.\end{split}\]
(c)
\[\gamma[w_m, w_n] =
\sum_{w_{m + 1}} Pr(\mathbf{x}_{m + 1} \mid w_{m + 1})
Pr(w_{m + 1} \mid w_m) \cdots
\sum_{w_{n - 1}} Pr(\mathbf{x}_{n - 1} \mid w_{n - 1})
Pr(w_{n - 1} \mid w_{n - 2}) Pr(w_n \mid w_{n - 1})\]
Notice that
\[\begin{split}Pr(w_m, w_n \mid \mathbf{x}_{1 \ldots N})
&= Pr(w_m \mid \mathbf{x}_{1 \ldots N})
Pr(w_n \mid w_m, \mathbf{x}_{1 \ldots N})\\
&= Pr(w_m \mid \mathbf{x}_{1 \ldots N})
Pr(w_n \mid w_m, \mathbf{x}_{(m + 1) \ldots N})\end{split}\]
is reduced to \(Pr(w_m \mid \mathbf{x}_{1 \ldots N})\) when
\(m = n\) i.e. \(\gamma = 1\).
When \(n = m + 1\), the expression reduces to
\(\gamma = Pr(w_n \mid w_m)\) because \(w_n\) and \(w_m\) are not
included in the marginalization.
The derivation of the last case when \(n > m + 1\) is similar to (a):
\[\begin{split}& \sum_{w_{m + 1}} Pr(\mathbf{x}_{m + 1} \mid w_{m + 1})
Pr(w_{m + 1} \mid w_m) \cdots
\sum_{w_{n - 1}} Pr(\mathbf{x}_{n - 1} \mid w_{n - 1})
Pr(w_{n - 1} \mid w_{n - 2}) Pr(w_n \mid w_{n - 1})\\
&= \sum_{w_{n - 1}} Pr(\mathbf{x}_{n - 1} \mid w_{n - 1})
Pr(w_n \mid w_{n - 1})
\sum_{w_{n - 2}} Pr(\mathbf{x}_{n - 2} \mid w_{n - 2})
Pr(w_{n - 1} \mid w_{n - 2}) \cdots
\sum_{w_{m + 1}} Pr(\mathbf{x}_{m + 1} \mid w_{m + 1})
Pr(w_{m + 2} \mid w_{m + 1}) Pr(w_{m + 1} \mid w_m)\\
&= \gamma_{n - 1}[w_n]\end{split}\]
where
\[\gamma_{n - 1}[w_n] =
\sum_{w_{n - 1}} Pr(\mathbf{x}_{n - 1} \mid w_{n - 1})
Pr(w_n \mid w_{n - 1}) \gamma_{n - 2}[w_{n - 1}]\]
and
\[\gamma_{m + 1}[w_{m + 2}] =
\sum_{w_{m + 1}} Pr(\mathbf{x}_{m + 1} \mid w_{m + 1})
Pr(w_{m + 2} \mid w_{m + 1}) Pr(w_{m + 1} \mid w_m).\]
Exercise 11.8
One can conclude that factor graphs enable one to determine the maximal cliques
in undirected graphical model.
(1)
\[Pr(x_1, x_2, x_3) =
\frac{1}{Z_1} \phi_{12}[x_1, x_2] \phi_{23}[x_2, x_3] \phi_{31}[x_3, x_1]\]
(1.a)
\[\begin{split}(x_1) \leftrightarrow (x_2)\\
(x_2) \leftrightarrow (x_3)\\
(x_3) \leftrightarrow (x_1)\end{split}\]
(1.b)
\[\begin{split}(x_1) -[A]- (x_2)\\
(x_2) -[B]- (x_3)\\
(x_3) -[C]- (x_1)\end{split}\]
(2)
\[Pr(x_1, x_2, x_3) =
\frac{1}{Z_2} \phi_{123}[x_1, x_2, x_3]\]
(2.a)
\[(x_1) \leftrightarrow (x_2) \leftrightarrow (x_3)\]
(2.b)
\[\begin{split}(x_1) -[A]-\\
(x_2) -[A]-\\
(x_3) -[A]-\end{split}\]
Exercise 11.9
[Bro] was very useful in answering how to define the factors of a
PDF for undirected graphical models. In short, anything goes as long as the set
of potential functions \(\phi\) for a particular factorization account for
the structure of the graph (e.g. edges being shared between cliques).
According to [KFL01], factor graphs of single components
are allowed. This makes sense because the independence of the nodes are lost
otherwise.
(a) and (d) are the only ones that do not have a cycle i.e. takes the form of a
chain.
(a)
\[Pr(w_{1 \ldots 6}) =
Pr(w_2) Pr(w_6) Pr(w_1 | w_2) Pr(w_5 | w_2, w_6) Pr(w_4 | w_5) Pr(w_3 | w_4)\]
\[\begin{split}(w_2) -[A]\\
(w_6) -[B]\\
(w_1) -[C]- (w_2)\\
(w_5) -[D]-\\
(w_2) -[D]-\\
(w_6) -[D]-\\
(w_4) -[E]- (w_5)\\
(w_3) -[F]- (w_4)\end{split}\]
(b)
\[Pr(w_{1 \ldots 6}) =
\frac{1}{Z} \phi_1[w_1, w_2, w_4] \phi_2[w_1, w_3, w_4]
\phi_3[w_2, w_4, w_5] \phi_4[w_5, w_6]\]
\[\begin{split}(w_1) -[A]-\\
(w_2) -[A]-\\
(w_4) -[A]-\\
(w_1) -[B]-\\
(w_3) -[B]-\\
(w_4) -[B]-\\
(w_2) -[C]-\\
(w_4) -[C]-\\
(w_5) -[C]-\\
(w_5) -[D]- (w_6)\end{split}\]
(c)
\[Pr(w_{1 \ldots 6}) =
Pr(w_2) Pr(w_6) Pr(w_1 | w_2) Pr(w_5 | w_2, w_6)
Pr(w_4 | w_1, w_5) Pr(w_3 | w_4)\]
\[\begin{split}(w_2) -[A]-\\
(w_6) -[B]-\\
(w_1) -[C]- (w_2)\\
(w_5) -[D]-\\
(w_2) -[D]-\\
(w_6) -[D]-\\
(w_4) -[E]-\\
(w_1) -[E]-\\
(w_5) -[E]-\\
(w_3) -[F]- (w_4)\end{split}\]
(d)
\[Pr(w_{1 \ldots 6}) =
\frac{1}{Z} \phi_1[w_1, w_3, w_4] \phi_2[w_2, w_4, w_5] \phi_3[w_5, w_6]\]
\[\begin{split}(w_1) -[A]-\\
(w_2) -[A]-\\
(w_4) -[A]-\\
(w_2) -[B]-\\
(w_4) -[B]-\\
(w_5) -[B]-\\
(w_5) -[C]- (w_6)\end{split}\]
Exercise 11.10
Notice that each variable \(w_i\) is connected to \(w_{i + 1}\) and
\(w_{i + 2}\). To reduce the chain model to one dependent predecessor,
define \(w'_j = \{ w_{2j - 1}, w_{2j} \}\) for all
\(j = [1, 2, \ldots, N / 2]\).
According to Exercise 11.2, a HMM with
\(N\) variables each taking \(K\) values has a time complexity of
\(\mathcal{O}\left( N K^2 \right)\) using the Viterbi algorithm.
The combined chain model will have \(N / 2\) variables each taking
\(K^2\) values hence the overall time complexity is
\(\mathcal{O}\left( N K^4 \right)\).
Exercise 11.11
Instead of organizing each scanline into a chain model, a \(n \times n\)
grid model could be imposed over each pixel in the scanline of each image.
Exercise 11.12
Run belief propagation on the factor graph [Wei00].
References
- Bro
Matthew Brown. Probabilistic graphical models - undirected graphical models. http://webdocs.cs.ualberta.ca/ greiner/C-651/SLIDES/MB01_UndirectedModels1.pdf. Accessed on 2017-07-16.
- KFL01
Frank R Kschischang, Brendan J Frey, and H-A Loeliger. Factor graphs and the sum-product algorithm. IEEE Transactions on information theory, 47(2):498–519, 2001.
- Wei00
Yair Weiss. Correctness of local probability propagation in graphical models with loops. Neural computation, 12(1):1–41, 2000.