%%%%%%%%%%%%%%%%%%%%
%I) PRELIMINARIES
%%%%%%%%%%%%%%%%%%%%
\documentclass[11pt,reqno]{amsart}
% Packages
%%%%%%%%%%%%%%%%%%
\usepackage{graphicx}%preferred package for inclusion of graphics
\usepackage{comment}
\usepackage{setspace}
\usepackage{enumerate}%for easy choice of enumerator symbol
\usepackage{enumitem}
\usepackage{tabularx}%for tables width user-defined width
\usepackage{ctable} %for toprule, midrule etc. in tables
\usepackage{multirow} %for more flexibility with tables
\usepackage{textcomp}%for cent-symbol
\usepackage[colorlinks=true, urlcolor=blue]{hyperref}%for inserting hyperlinks
\usepackage{caption} %flexibility with tables
\usepackage{subcaption} %flexibility with tables
\usepackage{url}%to get harvard to work, insert immediately before \title{...}
\usepackage[dcucite]{harvard} %bibliography style, dcu gives commas before year, semicolon between references, and "and" between authors
\usepackage{amssymb} %for the more esotheric math expressions, such as \approxeq
\usepackage{lineno}%for line numbers
\usepackage{lscape}%for inserting landscape format pages
\usepackage{float} % for more flexibility with tables
\usepackage{appendix}%allows for turning appendices on and off
\usepackage{epstopdf} %to allow import of .eps graphics
% Page formatting
%%%%%%%%%%%%%%%%%%%%
\pagestyle{plain} %puts page number center bottom
\setlength{\topmargin}{0in}
\setlength{\textheight}{8.5in}
\setlength{\oddsidemargin}{.0in}
\setlength{\evensidemargin}{.0in}
\setlength{\textwidth}{6.5in}
\setlength{\footskip}{.5in}
% Customized commands
%%%%%%%%%%%%%%%%%%%%%
% Math
\newcommand{\mlt}[1]{\mathbf{#1}} %matrix bold for Latin symbols
\newcommand{\mgr}[1]{\boldsymbol{#1}}%matrix bold for Greek symbols
\newcommand{\kl}{\left(}
\newcommand{\kr}{\right)}
\newcommand{\kll}{\left\{}
\newcommand{\krr}{\right\}}
\newcommand{\kmu}{\mgr{\mu}}
\newcommand{\kpsi}{\mgr{\psi}}
\newcommand{\kphi}{\mgr{\phi}}
\newcommand{\kpi}{\mgr{\pi}}
\newcommand{\kgam}{\mgr{\pi}}
\newcommand{\ktheta}{\mgr{\theta}}
\newcommand{\kalpha}{\mgr{\alpha}}
\newcommand{\kbeta}{\mgr{\beta}}
\newcommand{\kdelta}{\mgr{\delta}}
\newcommand{\kt}{^{\prime}}
\newcommand{\kdel}{\partial}
\newcommand{\kdot}{\kl . \kr}
\newcommand{\keps}{\epsilon}
\newcommand{\kx}{\mlt{x}}
\newcommand{\kX}{\mlt{X}}
\newcommand{\kV}{\mlt{V}}
\newcommand{\ky}{\mlt{y}}
\newcommand{\ka}{\mlt{a}}
\newcommand{\kb}{\mlt{b}}
\newcommand{\ki}{\mlt{i}}
\newcommand{\klam}{\lambda}
\newcommand{\kp}{\mlt{p}}
\newcommand{\kprob}{\text{prob}}
\newcommand{\kz}{\mlt{z}}
\newcommand{\ksig}{\sigma^2}
\newcommand{\kSig}{\mgr{\Sigma}}
\newcommand{\klog}{\text{log}}
\newcommand{\kols}{\kl \kX\kt\kX\kr^{-1}\kX\kt\ky}
\newcommand{\kSSE}{\kl \ky-\kX\kb\kr\kt\kl\ky-\kX\kb\kr}
%Special font within regular document
\newcommand{\chp}[1]{\textbf{\textsl{#1}}}
\newcommand{\km}[1]{\textsf{\small{#1}}} %special font for my own comments
\newcommand{\mlab}{\textbf{\texttt{Matlab }}}
\newcommand{\ktt}[1]{\textbf{\texttt{#1}}}
%Tables
\newcolumntype{C}{>{\centering\arraybackslash}X} %for centered columns within tabularx,instead of justified (the default)
%Others
\newcommand{\kpm}{PM_{2.5}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%
\begin{document}
%%%%%%%%%%%%%%%%%%%%%%%%
\citationmode{abbr} %use only "et al" citations
% the following add space before and after equations
\abovedisplayskip=0.7cm
\abovedisplayshortskip=-0.3cm
\belowdisplayskip=0.7cm
\belowdisplayshortskip=0.4cm
%III) TOP MATTER INFORMATION
\renewcommand{\harvardurl}{URL: \url}%to get harvard to work, insert immediately before \title{...}
\title{Finite Mixture Models}
\author{AAEC 6564 \\ Instructor: Klaus Moeltner}
\maketitle %this comes at the end of the top matter to set it.
\begin{flushleft}
\begin{tabbing}
\hspace{1.2in}\= \\\\%sets the tab stop
Textbooks:\> \citeasnoun{koop2003}, Ch.3; \citeasnoun{koopetal2007}, Ch.15\\
\mlab scripts:\> \texttt{mod10\_2CMM\_data, mod10\_2CMMln, mod10\_2CMMchi2}, \\
\> \texttt{mod10\_2CMMmix, mod10\_2CMMplots, mod10\_2CMM\_rearrange},\\
\> \texttt{mod10\_fmrm\_data, mod10\_fmrm, mod10\_fmrm\_data\_v2, mod10\_fmrm\_v2} \\
\mlab functions:\> \texttt{gs\_2cmm, gs\_fmrm}\\
\end{tabbing}
\end{flushleft}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{Model Outline}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
This setup is similar to the regime switching model (RSM, aka ``Roy model'') discussed previously, in that each observation may follow one of $G$ different regression models. Each regression model (``regime'') has its own set of coefficients (and potentially its own set of explanatory variables), and its own error variance. However, in contrast to the RSM, regime assignment is not observed and must be estimated probabilistically. That is, in addition to estimating $G$ sets of coefficients and variances, we need to estimated $G$ regime probabilities.\\
Let these regime probabilities be denoted as $\pi_g,\medspace g=1\ldots G$, with $\sum_{g=1}^G \pi_g=1$. We will collect these probabilities in vector $\kpi = \begin{bmatrix} \pi_1 & \pi_2 & \ldots \pi_G \end{bmatrix}\kt$.\\
The likelihood function for observation $i$, \emph{unconditional on regime membership} can then be expressed as a mixture-of-normals, that is:
%
\begin{equation}
\label{lhfi}
\begin{split}
& p\kl y_i | \kx_i, \kll \kbeta_g, \ksig_g \krr_{g=1}^G, \kpi\kr = \sum_{g=1}^G \pi_g \kl \kl 2\pi\kr^{-1/2} \kl \ksig_g \kr ^{-1/2} exp\kl -\tfrac{1}{2 \ksig_g} \kl y_i - \kx_i\kt\kbeta_g\kr^2 \kr \kr
\quad \text{with}\\
& \pi_g \in [0,1] \medspace \forall g, \quad \sum_{g=1}^G \pi_g = 1
\end{split}
\end{equation}
%
The product of this over all $i$ produces the likelihood for the entire sample. Computation is facilitated by augmenting the model with \emph{component indicator vector} $\kz_i$ for each $i$, given as
%
\begin{equation}
\label{zi}
\kz_i = \begin{bmatrix} z_{1i} & z_{2i} & \ldots & z_{Gi} \end{bmatrix}\kt,
\end{equation}
%
where $z_{gi}=1$ if $i$ is in the $g^{th}$ regime, and zero otherwise. We can then express the kernel of the \emph{augmented}, that is \emph{membership-conditioned} likelihood compactly as
%
\begin{equation}
\label{auglhf}
\begin{split}
&p\kl \ky | \kX, \kll \kbeta_g, \ksig_g \krr_{g=1}^G, \kll \kz_i \krr_{i=1}^n \kr \propto \\
&\prod_{i=1}^n \left[ \sum_{g=1}^G \kll \kl\ksig_g \kr ^{-1/2} exp\kl -\tfrac{1}{2 \ksig_g} \kl y_i - \kx_i\kt\kbeta_g\kr^2 \kr \krr I\kl z_{gi}=1\kr \right]
\end{split}
\end{equation}
%
where $I\kdot$ is the usual indicator function that takes a value of one if the condition it describes holds, and a value of zero otherwise.\\
The hierarchical prior of $\kz_i$, given $\kpi$, is modeled as a multinomial distribution with $T=1$ trial, that is $\kz \sim mn\kl 1,\kpi\kr$, explicitly given as:
%
\begin{equation}
\label{MNprior}
\begin{split}
&p\kl \kz_i | \kpi \kr = \kl \frac{T}{z_{1i}! z_{2i}! \ldots z_{Gi}!} \kr \pi_1^{z_{1i}} \pi_2^{z_{2i}} \ldots \pi_G^{z_{Gi}} = \prod_{g=1}^G \pi_g^{z_{gi}} \quad \text{with}\\
&\sum_{g=1}^G z_{gi}=T=1
\end{split}
\end{equation}
%
Regime probability vector $\kpi$, in turn, receives a Dirichlet prior with parameter vector $\kalpha$, that is $\kpi \sim D\kl \kalpha \kr$, explicitly given as:
%
\begin{equation}
\label{Dirprior}
\begin{split}
&p\kl \kpi \kr = \kl \frac{\Gamma \kl \kalpha \kr}{\prod_{g=1}^G \Gamma\kl \alpha_g \kr}\kr \prod_{g=1}^G \pi_g^{\alpha_g - 1}, \quad \text{where}\\
&\kalpha = \sum_{g=1}^G \alpha_g, \quad \alpha_g >0, \medspace \forall g
\end{split}
\end{equation}
%
Note that for $G=2$ the Dirichlet reduces to the univariate Beta distribution $B\kl\alpha_1,\alpha_2\kr$:
%
\begin{equation}
\label{Betaprior}
\begin{split}
&p\kl \pi \kr = \kl \frac{\Gamma \kl \alpha_1 + \alpha_2 \kr}{\Gamma\kl \alpha_1 \kr\Gamma\kl \alpha_2 \kr}\kr \pi^{\alpha_1 - 1}\kl 1-\pi\kr^{\alpha_2-1}, \quad \text{where}\\
&\alpha_g >0, \medspace g=1,2
\end{split}
\end{equation}
%
Combining all these parts, plus the usual multivariate normal prior for the $\kbeta_g$'s with (for simplicity) common prior mean $\kmu_0$ and common prior variance-covariance matrix $\kV_0$, and the standard inverse-gamma prior for the $\ksig_g$'s with (for simplicity) common shape $\nu_0$ and common scale $\tau_0$ leads to the following augmented posterior kernel:
%
\begin{equation}
\label{AugPost}
\begin{split}
&p\kl \kll \kbeta_g, \ksig_g \krr_{g=1}^G, \kpi, \kll \kz_i \krr_{i=1}^n | \ky, \kX \kr \propto \\
&\prod_{g=1}^G exp\kl -\tfrac{1}{2} \kl \kbeta_g - \kmu_{0}\kr \kt \kV_0^{-1} \kl \kbeta_g - \kmu_{0}\kr\kr * \\
&\prod_{g=1}^G \kl\ksig_g\kr^{\kl-\nu_0 + 1\kr} exp \kl -\frac{\tau_0}{\ksig_g} \kr *\\
&\prod_{g=1}^G \pi_g^{\alpha_g - 1} *\\
&\prod_{i=1}^n \kl\prod_{g=1}^G \pi_g^{z_{gi}}\kr *\\
&\prod_{i=1}^n \kl \sum_{g=1}^G \kll \kl\ksig_g \kr ^{-1/2} exp\kl -\tfrac{1}{2 \ksig_g} \kl y_i - \kx_i\kt\kbeta_g\kr^2 \kr \krr I\kl z_{gi}=1\kr \kr
\end{split}
\end{equation}
%
In each iteration of the Gibbs Sampler every observation $i$ will be assigned to one of the $G$ regimes. This determines the current sample size for regime $G$, say $n_g$, and the current rows of the data associated with regime $g$, let's call those $\ky_g$ and $\kX_g$, respectively. Then, for each regime $g$ the corresponding vector of regression coefficients. $\kbeta_g$ can be drawn as usual, that is:
%
\begin{equation}
\label{PostBeta}
\begin{split}
&\kbeta_g | \ksig_g, \ky_g, \kX_g \sim n\kl \kmu_{g1}, \kV_{g1} \kr,\quad \text{with}\\
&\kV_{g1} = \kl \kV_0^{-1} + \tfrac{1}{\ksig_g} \kX_g\kt\kX_g\kr^{-1},\quad \text{and}\\
&\kmu_{g1} = \kV_{g1}\kl \kV_0^{-1}\kmu_0 + \tfrac{1}{\ksig_g} \kX_g\kt\ky_g\kr
\end{split}
\end{equation}
%
Note, that, implicitly, $\kbeta_g$ is drawn conditional on $\kll \kz_i \krr_{i=1}^n$, since the latter determines the contents of data $\ky_g$ and $\kX_g$. Similarly, for the regime-specific variances we obtain:
%
\begin{equation}
\label{PostSig}
\begin{split}
&\ksig_g | \kbeta_g, \ky_g, \kX_g \sim ig\kl \nu_{g1}, \tau_{g1} \kr,\quad \text{with}\\
&\nu_{g1} = \frac{2 \nu_0 + n_g}{2},\quad \text{and}\\
&\tau_{g1} = \tau_0 + \tfrac{1}{2} \kl\ky_g - \kX\kbeta_g\kr\kt\kl \ky_g - \kX\kbeta_g \kr
\end{split}
\end{equation}
%
The relevant posterior kernel for draws of regime probabilities $\kpi$ is given as
%
\begin{equation}
\label{postPiPrep}
p\kl \kpi| \kll \kz_i \krr_{i=1}^n \kr \propto
\prod_{g=1}^G \pi_g^{\alpha_g - 1} \prod_{i=1}^n \kl\prod_{g=1}^G \pi_g^{z_{gi}}\kr = \prod_{g=1}^G \pi_g^{\kl\alpha_g + n_g -1\kr}
\end{equation}
%
This points again to a Dirichlet distribution with parameters $\alpha_1 + n_1, \alpha_2+n_2, \ldots, \alpha_G+n_G$, i.e.
%
\begin{equation}
\label{postPi}
\kpi | \kll \kz_i \krr_{i=1}^n \sim D\kl \alpha_1 + n_1, \alpha_2+n_2, \ldots, \alpha_G+n_G \kr
\end{equation}
%
As a final step, we need to draw new assignment vectors $\kz_i$ for each observation at each step of the Gibbs Sampler. The relevant posterior kernel for a given $\kz_i$ is given by:
%
\begin{equation}
\label{postZi}
\begin{split}
&p\kl \kz_i | \kll \kbeta_g, \ksig_g \krr_{g=1}^G, \kpi, \ky_i, \kX_i \kr \propto \\
&\prod_{g=1}^G \pi_g^{z_{gi}} *\sum_{g=1}^G \kll \kl\ksig_g \kr ^{-1/2} exp\kl -\tfrac{1}{2 \ksig_g} \kl y_i - \kx_i\kt\kbeta_g\kr^2 \kr \krr I\kl z_{gi}=1\kr
\end{split}
\end{equation}
%
Bringing back in the normalizing constant for the normal density in the second line of (\ref{postZi}), and replacing the indicator function with an exponent, this can be equivalently written as
%
\begin{equation}
\label{postZi2}
\begin{split}
&p\kl \kz_i | \kll \kbeta_g, \ksig_g \krr_{g=1}^G, \kpi, \ky_i, \kX_i \kr \propto \\
&\prod_{g=1}^G \pi_g^{z_{gi}} *\phi\kl y_i, \kx_i\kt\kbeta_g, \ksig_g\kr^{z_{gi}} = \\
&\prod_{g=1}^G \kl \pi_g\phi\kl y_i, \kx_i\kt\kbeta_g, \ksig_g\kr\kr^{z_{gi}},
\end{split}
\end{equation}
%
where $\phi\kdot$ denotes the normal density. This can be recognized as the kernel of another \emph{multinomial} density with a single trial:
%
\begin{equation}
\label{postZi3}
\begin{split}
&\kz_i | \kll \kbeta_g, \ksig_g \krr_{g=1}^G, \kpi, \ky_i, \kX_i \sim \\
& mn \kl 1, \frac{\pi_1 \phi\kl 1\kr}{\sum_{g=1}^G\kl \pi_g\phi\kl g\kr\kr},
\frac{\pi_2 \phi\kl 2\kr}{\sum_{g=1}^G\kl \pi_g\phi\kl g\kr\kr},
\ldots,
\frac{\pi_G \phi\kl G\kr}{\sum_{g=1}^G\kl \pi_g\phi\kl g\kr\kr}\kr, \quad \text{where}\\
&\phi\kl g\kr = \phi\kl y_i, \kx_i\kt\kbeta_g, \ksig_g\kr
\end{split}
\end{equation}
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{Using mixtures-of-normals to approximate unknown densities}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
As shown in exercise 15.5 in \citeasnoun{koopetal2007}, finite mixture models are quite robust to mis-specifications of the underlying likelihood. In their example, the true data-generating mechanism varies from a log-normal, to a chi-squared, to a two regime (``component'') mixture-of-normals (2CMM). In each case, the same 2CMM is used to estimate the model. Comparing the posterior predictive distribution flowing from the model to the true underlying data generating density shows that the 2CMM does a good job recovering the shape of the data, even under mis-specification.\\
Matlab scripts \texttt{mod10\_2CMM...} and function \texttt{gs\_2cmm} replicate KPT's results.\\
Note: To draw $n$ 1 by $G$ vectors from the multinomial with $T$ trials and 1 by $G$ probability vector $\kp$, use Matlab's built-in function \texttt{mnrnd(T,$\kp$,n)}. If $\kp$ is already a $n$ by $G$ matrix, as in our case, use \texttt{mnrnd(T,$\kp$)}.\\
To obtain $n$ columns of draws from the Dirichlet with $G$ by 1 parameter vector $\ka$ use my own function \texttt{dirrnd($\ka$,n)}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{The label-switching issue}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
As discussed in \citeasnoun{geweke2007} the mixture model as outlined above is \emph{invariant to a permutation of the regime labels}, that is the likelihood function yields the same value irrespective of the order in which we add up the $G$ components, that is the terms $\pi_g \phi\kl y_i, \kx_i\kt\kbeta_g,\ksig_g\kr$. As a result, the posterior simulator (our Gibbs Sampler)
``doesn't care'' either which regime is ``1'', ``2,'' etc. So in one iteration the first set of coefficients (which are supposed to be from regime 1) may actually belong to (``true'') regime 1, while in the next iteration the first set might actually belong to (``true'') regime 2, and so on. The same dilemma holds for regime-specific variances.\\
In consequence, drawing inference from all ``set 1'' coefficients produced by the Gibbs Sampler may be meaningless, as it might be based on a mix of coefficients from all regimes. As mentioned in \citeasnoun{geweke2007}, if there is a natural ordering in the magnitude of (true) parameters across regimes, as in $\kbeta_1 < \kbeta_2$ for all elements of $\kbeta$, or $\ksig_1 < \ksig_2$, then, asymptotically this labeling dilemma vanishes, but it may take a very large sample for this to work.\\
As you can see from the preceding exercise, for our large sample of 10,000 the algorithm de facto switched the regime means, variances, and probabilities compared to the way they were labeled in the true data-generating mechanism. To be specific, when we created the data, we let $\pi_1=0.6$ and $\pi_2=0.4$, but the Matlab log indicates the opposite. The same switching occurred for the regime means and variances. But at least \emph{within} each regime, we recover the correct mean and variance. Thus, for this example, the label switch occurred immediately, and there was little or no switching back (else our naive posterior means would be way off).\\
According to \citeasnoun{geweke2007} this entire regime-switching dilemma can be completely ignored if the final construct of interest is invariant to this switching, for example a posterior predictive outcome. In fact, when the mixture model is primarily used as a flexible approximation of an unknown density function, as in the example above, the regimes are by definition devoid of any substantive interpretation, so this label-switching can be ignored. This becomes apparent from our plotted predictive densities, which map out the true underlying sampling density just fine, regardless of regime labeling.\\
As an (important) aside, the authors also note that if the state labels have no substantive interpretation, the priors should also be permutation-invariant (which is trivially satisfied if all regimes receive the same prior, as is the case in the above example).\\
If labeling matters (for example, we hypothesize that the marginal effect of some variable should systematically differ across regimes), and if there is some pre-known natural ordering in one or more of the regime-specific parameters as illustrated above, \citeasnoun{geweke2007} show that this can be simply addressed by running the Gibbs sampler as usual (ignoring any labeling), and then re-labeling the parameter draws \emph{ex-post} to honor the inequality constraints. In our example, for example, we stipulate that $\ksig_1=1$ and $\ksig_2=0.5$, so we could, for each draw of the Gibbs Sampler, re-arrange all parameters (mean, variance, probability) that correspond to the smaller variance draw to belong to the \emph{first} regime.\\
This is shown in script \texttt{mod10\_2CMMmix\_rearrange}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{A full mixtures-of-normals regression model}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The previous example was a simplified version of a mixture model compared to our theoretical derivation, as it only included a single mean and variance per regime. As a next step, we will allow the regime means to be linear functions of data and parameters, which corresponds more directly to the structural model above.\\
Matlab scripts \texttt{mod10\_fmrm\_data} generates simulated data for such a model, with three regimes. To facilitate regime assignments (and thus reduce the labeling problem) we allow the true coefficients in each regime to be increasing in magnitude over regimes, while keeping the regime variances small and equal. Script \texttt{mod10\_fmrm} implements the sampler. As you can see from the log file, the algorithm generally does a good job recovering the true parameters, though efficiency is less than desirable for several cases. This ``slow mixing'' problem is a well-known issue for finite mixture model\cite{sylvia2001,geweke2007}.\\
We can improve things by forcing the bin coefficients even further apart while keeping variances small, as implemented in \texttt{mod10\_fmrm\_data\_v2} and \texttt{mod10\_fmrm\_v2}. As is evident from the log file, this sampler has much better mixing properties than the original version. Naturally, in a real world application, we may not be able to make a-priori assumptions on the relative magnitude of regime-specific parameters and/or regime-specific parameters may be similar in magnitude. In such cases the label-switching problem may be a real barrier to inference, other than for predictive constructs.
\bibliography{AAEC6564bib}
\bibliographystyle{dcu}
\end{document}