This week I will start my second quarter as a PhD student at the University of Washington and I will be studying Graphical Models with Sewoong Oh over in the CS department, a class which I’ve been looking forward to for a number of reasons, chief among them that I like the opportunity to employ (seemingly disparate) pure math techniques to unsuspecting applied problem domains; in this case, solving probabilistic inference problems via graph theory.

This is the first of a series designed to introduce the concept of probabilistic graphical models, motivate their definitions, outline related algorithms, and (time-permitting) show their implementation in Haskell.

Disclaimer: this Part 1 introductory post is rooted in the excessively optimistic assumption that I will have time to write subsequent posts that expand from this introduction.

Prerequisites are in flux. Since I’ve only just started this class it is impossible to know exactly how much of it can be disseminated to a broader audience. At the very least, you will need

Mathematical maturity, at least enough to understand the definition of a graph
Probability at the undergraduate level.

Motivation

Let \(\mathcal{X}\) be some alphabet¹ and consider a probability distribution over the sample space \(\mathcal{X}^n\) with probability mass function \(\mu\). That is, letting \((X_1,\ldots,X_n)\) denote a random vector over this distribution,

\[\mu(x_1, \ldots, x_n) = \mathbb{P}(X_1 = x_1, \ldots, X_n = x_n).\]

In this fully general setting where the \(X_i\)’s can be arbitrarily interdependent, notice that \(\mu\) must be specified by \(O(|\mathcal{X}|^n)\) values. However, if the \(X_i\)’s are independent, there is a tremendous simplification as this distribution factors into \(\mathbb{P}(X_1 = x_1) \cdots \mathbb{P}(X_n = x_n)\), and in this case, \(\mu\) is specified by only \(O(n|\mathcal{X}|)\) values. By exploiting this admittedly best-case independence scenario, our computations regarding \(\mu\) reduce from exponential to linear in the dimension \(n\).

This is the main idea behind our graphical model representations. While it is rare for the \(X_i\) to be completely independent, it is often the case that some of the \(X_i\) are conditionally independent (or at least, the assumption of such conditional independence is a reasonable approximation), which allows a reduction in complexity and thus more efficient and feasible algorithms.

Example

Let’s take a small detour to look an illustrative example. Lately in U.S. domestic politics, partisanship and party alliance seem to increasingly determine an individual’s views on a given issue. This observation suggests the approximation that issues are conditionally independent given the party alliance is known. So suppose we pick an American² at random and consider the probability space induced by three random variables Party \((P)\), Gun Control \((G)\), and Health Care \((H)\). For the sake of simplicity we will treat these as binary random variables, i.e.

\[ Val(P) = \{p^0, p^1\} \quad Val(G) = \{g^0, g^1\} \quad Val(H) = \{h^0, h^1\} \]

where

\begin{align} p^0 &:= \text{republican} \\
p^1 &:= \text{democrat} \\
g^0 &:= \text{does not support stricter gun laws} \\
g^1 &:= \text{supports stricter gun laws} \\
h^0 &:= \text{does not support Medicare for all} \\
h^1 &:= \text{supports Medicare for all} \end{align}

As discussed above, a fully general probablistic model allows for arbitrary \(\mathbb{P}(P,G,H)\) probabilities for the \(2^n\) possible outcomes in this space.³ One such distribution is given in the table below.

\(P\)	\(G\)	\(H\)	\(\mathbb{P}(P, G, H)\)
\(p^0\)	\(g^0\)	\(h^0\)	\(0.248\)
\(p^0\)	\(g^0\)	\(h^1\)	\(0.097\)
\(p^0\)	\(g^1\)	\(h^0\)	\(0.112\)
\(p^0\)	\(g^1\)	\(h^1\)	\(0.043\)
\(p^1\)	\(g^0\)	\(h^0\)	\(0.016\)
\(p^1\)	\(g^0\)	\(h^1\)	\(0.054\)
\(p^1\)	\(g^1\)	\(h^0\)	\(0.099\)
\(p^1\)	\(g^1\)	\(h^1\)	\(0.331\)

While the specific joint probabilities are contrived, the marginal probabilities per party are based on real polling⁴, so these values should fit roughly with our intuition.

Let’s explore the dependence among these variables, first computing the probability that an individual favors Medicare for all: \begin{align} \mathbb{P}(H = h^1) &= \sum_{p, g} \mathbb{P}(H = h^1, P = p, G = g) \\
&= 0.097 + 0.043 + 0.054 + 0.331 \\
&= 0.525 \end{align} Marginally it is more likely that an individual is in favor of Medicare for all. However, suppose we observe that this individual does not support stricter gun laws. How does this change the probability of the former issue? \begin{align} \mathbb{P}(H = h^1 | G = g^0) &= \frac{\mathbb{P}(H = h^1, G = g^0)}{\mathbb{P}(G = g^0)} \\
&= \frac{\sum_{p}\mathbb{P}(G = g^0, H = h^1, P=p)} {\sum_{p,h}\mathbb{P}(G = g^0, P=p, H=h)} \\
&= \frac{0.097 + 0.054}{0.248+0.097+0.016+0.054} \\
&= 0.364 \end{align} The conclusion to draw from this calculation is that \(H\) and \(G\) are quite dependent. Recall that independence holds if and only if observing \(G\) does not change our beliefs about \(H\); in this case we see that observing \(G = g^0\) rather drastically updates our beliefs, and we now find it very unlikely that \(H = h^1\). So we’re not at all justified in an assumption of independence.

But how exactly does the observation \(G\) influence \(H\)? Recall our discussion above about that pesky p-word: partisanship. Let’s run through the same calculations, but this time, suppose we know the individual is a Democrat beforehand.

\begin{align} \mathbb{P}(H = h^1 | P = p^1) &= \frac{\mathbb{P}(H = h^1, P = p^1)} {\mathbb{P}(P = p^1)} \\
&= \frac{0.054 + 0.331} {0.500} \\
&= 0.77 \end{align}

Again, suppose we also observe \(g^0\): \begin{align} \mathbb{P}(H = h^1 | G = g^0, P = p^1) &= \frac{\mathbb{P}(H = h^1, G = g^0, P = p^1)} {\mathbb{P}(G = g^0, P=p^1)} \\
&= \frac{0.054}{0.016 + 0.054} \\
&= 0.77 \end{align}

So if we already know the individual’s party affiliation \(P\), then \(G\) and \(H\) are independent, which we denote by \( G \perp H \mid P \). Thus when observing \(G\) changed our beliefs about \(H\) in the first case, it did so solely by changing our beliefs about \(P\): since it is so unlikely that a Democrat is against stricter gun laws, we think it is more likely that such an individual is a Republican, which then lowers the probability that they support Medicare for all. But interestingly, when the party is already known, there is no other trail of influence for observing \(G\) to change our beliefs about \(H\).

To wrap up this example, we exploit the complexity reduction mentioned above. By definition \(G \perp H \mid P\) is equivalent to \(\mathbb{P}(G, H \mid P) = \mathbb{P}(G \mid P) \mathbb{P}(H \mid P)\) Thus we can write

\begin{align} \mathbb{P}(P, G, H) &= \mathbb{P}(P)\,\mathbb{P}\,(G, H \mid P) \\
&= \mathbb{P}(P) \,\mathbb{P}(G \mid P) \,\mathbb{P}(H \mid P) \\
&\approx O(2^{n-1}). \\
\end{align}

Bayesian Networks

We say that the distribution with mass function \(\mu(x_1,\ldots,x_n)\) is a Bayesian Network (BN) if there exists a directed acyclic graph G with nodes \(\{1,\ldots,n\}\) such that

\[\mu(x_1, \ldots, x_n) = \prod_{k = 1}^{n} \mu(x_k \mid x_{\pi(k)})\]

where \(\pi(k)\) is the set of parents of node \(k\) in G. As implied by the definition, BNs will allow us to encode certain conditional independencies in a distribution in the form of a graph.

First, note that in the general and arbitrarily dependent case, by repeatedly applying the defintion of conditional probability, one can derive⁵

\[\mu(x_1, \ldots, x_n) = \prod_{k = 1}^{n} \mu(x_k \mid x_1,\ldots,x_{k-1}).\]

which would be represented by the DAG below.

Since these arcs represent a flow of influence and dependence among variables, our desire for independence structures is equivalent to a desire for sparser graphs.

In the example above we showed that \(\mu\) factors into

\[\mu(x_p, x_g, x_h) = \mu(x_p)\,\mu(x_g \mid x_p) \,\mu(x_h \mid x_p)\]

which can be represented by the graph below.

It takes some familiarity to read independencies implied by the graph structure.⁶ In this case, each node can influence the others, as expected. However, since \(P\) and \(H\) are only connected via their single parent node \(P\), we can infer directly from the graph that they are conditionally independent given \(P\). In other words, denoting the trail of influence by red arrows, and observed variables by blue nodes, we discovered the following:

that is, \(G\) can influence \(H\) via \(P\), but once \(P\) is observed this trail of influence is blocked, because observing \(G\) will clearly not update our beliefs about \(P\) if we already have observed \(P\).

In general, given two edges between three nodes, the following trails of influence are possible:

Conclusion

We’ve just scratched the surface on the topic of graphical models, but I hope this is enough to at least motivate why we’re interested in such graphs. I’d like to mention that BNs are just one type of graphical model, but equally popular are Markov Random Fields (on undirected graphs) and Factor Graphs (on undirected bipartite graphs), among others. These different types of induced graphs can encode different types of conditional independencies, and thus have their own unique advantages in applications.

In my next post, assuming I find the time to write it, I’d like to explore a specific Bayesian Network called Hidden Markov Model, and show how exploiting the graph structure leads to an efficient message-passing algorithm called belief propagation for finding marginal probabilities.

Footnotes

Typically a finite set, e.g. \(\{0,1\}\). ↩
Actually first remove the Independents and other parties, then pick a Democrat or Republican party member at random. (Sorry, it’s just easier to deal with a binary random variable!) ↩
Technically, because these probabilities must sum to one, the distribution is fully specified by the first \(2^n - 1\) values. ↩
Opinions on gun control and health care, conditional by party, are from polling by Pew Research Center and Kaiser Family Foundation, respectively. ↩
See wikipedia for a derivation. ↩
I’d encourage anyone interested to refer to Daphne Koller’s coursera course and associated textbook. ↩