%%%%%%%%%%%%%%%%%%%%
%I) PRELIMINARIES
%%%%%%%%%%%%%%%%%%%%
\documentclass[11pt,reqno]{amsart}
% Packages
%%%%%%%%%%%%%%%%%%
\usepackage{graphicx}%preferred package for inclusion of graphics
\usepackage{comment}
\usepackage{setspace}
\usepackage{enumerate}%for easy choice of enumerator symbol
\usepackage{tabularx}%for tables width user-defined width
\usepackage{ctable} %for toprule, midrule etc. in tables
\usepackage{multirow} %for more flexibility with tables
\usepackage{textcomp}%for cent-symbol
\usepackage[colorlinks=true, urlcolor=blue]{hyperref}%for inserting hyperlinks
\usepackage{caption} %flexibility with tables
\usepackage{subcaption} %flexibility with tables
\usepackage{url}%to get harvard to work, insert immediately before \title{...}
\usepackage{html}%to get harvard to work, insert immediately before \usepackage{harvard}
\usepackage[dcucite]{harvard} %bibliography style, dcu gives commas before year, semicolon between references, and "and" between authors
\usepackage{amssymb} %for the more esotheric math expressions, such as \approxeq
\usepackage{lineno}%for line numbers
\usepackage{lscape}%for inserting landscape format pages
\usepackage{float} % for more flexibility with tables
\usepackage{appendix}%allows for turning appendices on and off
\usepackage{epstopdf} %to allow import of .eps graphics
% Page formatting
%%%%%%%%%%%%%%%%%%%%
\pagestyle{plain} %puts page number center bottom
\setlength{\topmargin}{0in}
\setlength{\textheight}{8.5in}
\setlength{\oddsidemargin}{.0in}
\setlength{\evensidemargin}{.0in}
\setlength{\textwidth}{6.5in}
\setlength{\footskip}{.5in}
% Customized commands
%%%%%%%%%%%%%%%%%%%%%
% Math
\newcommand{\mlt}[1]{\mathbf{#1}} %matrix bold for Latin symbols
\newcommand{\mgr}[1]{\boldsymbol{#1}}%matrix bold for Greek symbols
\newcommand{\kl}{\left(}
\newcommand{\kr}{\right)}
\newcommand{\kll}{\left\{}
\newcommand{\krr}{\right\}}
\newcommand{\kmu}{\mgr{\mu}}
\newcommand{\kpsi}{\mgr{\psi}}
\newcommand{\kphi}{\mgr{\phi}}
\newcommand{\kgam}{\mgr{\gamma}}
\newcommand{\ktheta}{\mgr{\theta}}
\newcommand{\kbeta}{\mgr{\beta}}
\newcommand{\kdelta}{\mgr{\delta}}
\newcommand{\kt}{^{\prime}}
\newcommand{\kdel}{\partial}
\newcommand{\kdot}{\kl . \kr}
\newcommand{\keps}{\epsilon}
\newcommand{\kx}{\mlt{x}}
\newcommand{\kX}{\mlt{X}}
\newcommand{\kZ}{\mlt{Z}}
\newcommand{\kV}{\mlt{V}}
\newcommand{\kM}{\mlt{M}}
\newcommand{\kI}{\mlt{I}}
\newcommand{\ky}{\mlt{y}}
\newcommand{\kb}{\mlt{b}}
\newcommand{\ki}{\mlt{i}}
\newcommand{\klam}{\lambda}
\newcommand{\kp}{\mlt{p}}
\newcommand{\kprob}{\text{prob}}
\newcommand{\kz}{\mlt{z}}
\newcommand{\ksig}{\sigma^2}
\newcommand{\kSig}{\mgr{\Sigma}}
\newcommand{\klog}{\text{log}}
\newcommand{\kols}{\kl \kX\kt\kX\kr^{-1}\kX\kt\ky}
\newcommand{\kSSE}{\kl \ky-\kX\kb\kr\kt\kl\ky-\kX\kb\kr}
\newcommand{\kzero}{\mlt{0}}
\newcommand{\kgX}{g\kl \kX\kt\kX\kr^{-1}}
%Special font within regular document
\newcommand{\chp}[1]{\textbf{\textsl{#1}}}
\newcommand{\km}[1]{\textsf{\small{#1}}} %special font for my own comments
\newcommand{\mlab}{\textbf{\texttt{Matlab }}}
\newcommand{\ktt}[1]{\textbf{\texttt{#1}}}
%Tables
\newcolumntype{C}{>{\centering\arraybackslash}X} %for centered columns within tabularx,instead of justified (the default)
%Others
\newcommand{\kpm}{PM_{2.5}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%
\begin{document}
%%%%%%%%%%%%%%%%%%%%%%%%
\citationmode{abbr} %use only "et al" citations
% the following add space before and after equations
\abovedisplayskip=0.7cm
\abovedisplayshortskip=-0.3cm
\belowdisplayskip=0.7cm
\belowdisplayshortskip=0.4cm
%III) TOP MATTER INFORMATION
\renewcommand{\harvardurl}{URL: \url}%to get harvard to work, insert immediately before \title{...}
\title{Bayesian Model Search and Model Averaging}
\author{AAEC 6984 \\ Instructor: Klaus Moeltner}
\maketitle %this comes at the end of the top matter to set it.
\begin{flushleft}
\begin{tabbing}
\hspace{1.8in}\= \\\\%sets the tab stop
Textbooks:\> \citeasnoun{koop2003}, Ch.11; \citeasnoun{koopetal2007}, Ch.16 \\
\mlab scripts:\> \texttt{mod8\_SSVS\_data, mod8\_SSVS, mod8\_SSVS\_modProbs}, \\
\> \texttt{mod8\_SSVS\_fishing\_data, mod8\_SSVS\_fishing, mod8\_SSVS\_fishing\_modProb}, \\
\> \texttt{mod8\_SSVS\_fishing\_BMA}\\
\> \texttt{mod8\_MC3,mod8\_MC3\_convTest, mod8\_MC3\_modProb},\\
\> \texttt{mod8\_MC3\_growth,mod8\_MC3\_growth\_convTest, mod8\_MC3\_growth\_modProb},\\
\mlab functions:\> \texttt{gs\_SSVS, gs\_SSVS\_fishing, graycode, gs\_MC3}\\
\end{tabbing}
\end{flushleft}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{Introduction}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We have already seen that the Bayesian estimation framework allows for the pair-wise comparison of different models, even non-nested ones, via the computation of marginal likelihoods and Bayes Factors. \\
This module deals with situations where a whole array of candidate models must be considered, evaluated, and compared. In theory, one could estimate each model separately, compute the marginal likelihood for each case, and derive individual model probabilities as the ratio of the model-specific marginal likelihood (or the product of marginal likelihood and model prior) to the sum of marginal likelihoods over all models.\\
Specifically, denoting $p\kl \ky |M_m\kr$ as the marginal likelihood for model $m$, $p\kl M_m \kr$ the model prior, and the complete \emph{model space} as $\mathcal{M}=\kll M_1,M_2,\ldots,M_M\krr$, individual model probabilities can be derived as
\begin{equation}
p\kl M_m|\ky \kr = \frac{p\kl \ky |M_m\kr p\kl M_m \kr}{\sum_{j=1}^M p\kl \ky |M_j\kr p\kl M_j \kr}
\end{equation}
%
If model priors are the same for all models, this further simplifies to
%
\begin{equation}
p\kl M_m|\ky \kr = \frac{p\kl \ky |M_m\kr}{\sum_{j=1}^M p\kl \ky |M_j\kr}
\end{equation}
%
A model-averaged posterior distribution for parameters $\ktheta$ can then be obtained via
%
\begin{equation}
p\kl \ktheta |\ky \kr = \sum_{m=1}^M p\kl \ktheta |\ky, M_m \kr p\kl M_m|\ky \kr
\end{equation}
%
In practice (i.e. in absence of an analytical solution to this expression), this is implemented by first obtaining $R$ draws of $p\kl \ktheta |\ky, M_m \kr$ for each model, and then drawing from these model-specific posteriors with relative frequency dictated by the computed model weights. All subsequent inference, including posterior predictive densities, are then based on these model-weighted, or model-averaged draws.\\
While this approach is conceptually straightforward, its implementation becomes quickly infeasible with increasing model space. Therefore, Bayesians have developed model-search methods that, for specific applications, can ``travel'' quickly through model space and identify the most promising specifications. In most cases, the model search component simply becomes an extra step in a Gibbs Sampler.\\
In this module we will discuss the two most common approaches to model search when the task at hand is to identify ``meaningful'' explanatory variables in a regression model: Stochastic Search Variable Selection (\emph{SSVS}) \cite{georgemcculloch1993}, and Markov-Chain Monte-Carlo Model Composition ($MC^3$) \cite{madiganyork1995}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{Stochastic Search Variable Selection}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The SSVS algorithm has the following salient features:
\begin{enumerate}
\item Variable inclusion probabilities are directly modeled via a mixture prior for each coefficient
\item Implementation proceeds via a standard Gibbs Sampler without MH steps
\item At each iteration, the full model (with all possible variables) is evaluated
\item Irrelevant variables will end up with a coefficient that has both posterior mean and variance close to zero
\item At each iteration, the algorithm captures if a given coefficient was deemed ``important'' or ''irrelevant''
\item This information can then be used to compute posterior inclusion probabilities for each coefficient and posterior model probabilities
\item The SSVS algorithm does \emph{not} lend itself to a direct derivation of model-averaged posterior distributions.\\
Consider a standard linear regression model with an intercept $\alpha$ and a potentially large number of slope coefficients $\beta_j,\medspace j=1\ldots k-1$. The sample likelihood is given as:
%
\begin{equation}
\label{samplingDistribution}
\begin{split}
&p\kl \ky | \ktheta, \kX \kr = \kl 2\pi\kr^{-n/2} \kl \ksig \kr^{-n/2}
exp\kl -\frac{1}{2\sigma^2}\kl \ky-\kZ\ktheta\kr\kt \kl \ky-\kZ\ktheta\kr \kr,\quad \text{where} \\
&\kZ=\begin{bmatrix} \ki & \kX \end{bmatrix} \\
&\ktheta = \begin{bmatrix} \alpha & \beta_1 & \beta_2 & \ldots & \beta_{k-1} \end{bmatrix}\kt
\end{split}
\end{equation}
%
Assume that economic theory provides little or no guidance as to the inclusion or exclusion of the variables in $\kX$. That is, the total number of possible models, each distinguished by a different combination of regressors, is $2^{k-1}$. Even with just a ``moderate'' number of candidates, this model space quickly explodes and makes it infeasible to evaluate every single model.\\
The SSVS algorithm tackles this problem via a mixture prior for each $\beta_j$. It assigns a \emph{point-mass} prior probability $p_0$ that $\beta_j$ comes from a \emph{healthy} normal density with mean zero and ``large'' variance $c^2*t^2$. With probability $1-p_0$, the prior density for $\beta_j$ is ``virtually degenerate'', i.e. a normal density with mean zero and variance $t^2$, where $t^2$ is also close to zero. Thus:
%
\begin{equation}
\label{betaPrior}
\begin{split}
&p\kl \beta_j\kr = p_0 \phi\kl 0,c^2*t^2\kr + (1-p_0)\phi\kl 0,t^2\kr
\end{split}
\end{equation}
%
where $\phi$ denotes the normal pdf. In theory, the variance ``tuners'' $c^2$ and $t^2$ could be specific to each coefficient, perhaps based on ``thresholds of practical significance'' \cite{georgemcculloch1997}. For simplicity, we assign the same values of $c^2$ and $t^2$ to all coefficients. The larger the difference in magnitude between $c$ and $t$, the more sharply the algorithm will be able to discriminate between relevant and irrelevant variables. However, \citeasnoun{georgemcculloch1993} and \citeasnoun{georgemcculloch1997} recommend against a variance ratio of higher than 10,000, and a value of $t$ ``too close to zero'' to avoid convergence problems in the sampler.\\
The model is \emph{augmented} with the introduction of indicator vector $\kgam =\begin{bmatrix} \gamma_1 & \gamma_2 & \ldots & \gamma_{k-1} \end{bmatrix}\kt$, where each element is a binary 0 / 1 indicator, with ``1'' signaling that the corresponding $\beta_j$ comes from the ``healthy'' prior density. This augmentation facilitates model implementation and allows to capture inclusion counts for each coefficient as by-product of the Gibbs Sampler.\\
Thus, we can re-express the prior of $\beta_j$ with the following hierarchical structure:
%
\begin{equation}
\label{betaPriorHierarchical}
\begin{split}
&p\kl \beta_j |\gamma_j\kr = \gamma_j \phi\kl 0,c^2*t^2\kr + (1-\gamma_j)\phi\kl 0,t^2\kr, \quad \\
&\gamma_j \sim Bin \kl p_0,1\kr
\end{split}
\end{equation}
%
where $Bin \kl p,1\kr$ indicates the Binomial distribution with parameter $p_0$ and a single trial (essentially a Bernoulli density).
Choosing a normal prior for the intercept, and the standard inverse-gamma prior for the variance of the regression error with shape $\nu_0$ and scale $\tau_0$ yields the following augmented joint posterior kernel:
%
\begin{equation}
\label{jointPosterior}
\begin{split}
&p\kl \ktheta,\ksig,\kgam |\ky,\kZ \kr \propto p\kl \alpha\kr p\kl \kbeta |\kgam \kr p\kl \kgam \kr p\kl \ksig \kr p\kl \ky | \ktheta,\ksig,\kZ \kr \propto\\
&exp\kl -\frac{1}{2V_\alpha} \kl \alpha - \mu_{\alpha}\kr^2\kr *\\
&\prod_{j=1}^{k-1} \kll\gamma_j exp\kl -\frac{1}{2c^2 t^2} \kl \beta_j - \mu_{\beta_j}\kr^2\kr + \kl 1-\gamma_j\kr exp\kl -\frac{1}{2t^2} \kl \beta_j - \mu_{\beta_j}\kr^2\kr \krr*\\
&\prod_{j=1}^{k-1} p_0^{\gamma_j} \kl 1- p_0\kr^{\kl 1-\gamma_j\kr} *\\
&\kl \ksig \kr ^{\frac{-n-2\nu_0-2}{2}} exp\kl -\frac{1}{\ksig}\kl \tau_0\kr\kr *\\
& exp \kl -\tfrac{1}{2\ksig}\kl \ky-\kZ\ktheta\kr\kt \kl\ky-\kZ\ktheta\kr \kr
\end{split}
\end{equation}
%
Conditional on $\kgam$, the intercept $\alpha$ and the coefficient vector $\kbeta$ can be drawn jointly as vector $\ktheta$ if we collect their variances into a single matrix $\kV_0$, given as:
\begin{equation}
\label{V0}
\begin{split}
&\kV_0 = diag \begin{bmatrix} V_\alpha & V_1 & V_2 & \ldots & V_{k-1} \end{bmatrix},\quad \text{where} \\
&V_j = \gamma_j c^2*t^2 + \kl 1-\gamma_j\kr t^2
\end{split}
\end{equation}
This then leads to the standard Gibbs Sampler for the linear regression model with two important modifications: (i) $\kgam$ is drawn in a third step of the sampler, and (ii) the prior variance $\kV_0$ is adjusted accordingly at every iteration. Specifically, we have:
%
\begin{equation}
\label{condPostTheta}
\begin{split}
&\ktheta|\ksig,\kgam,\ky,\kZ \sim n\kl \kmu_1,\kV_1\kr,\quad \text{with}\\
&\kV_1 = \kl \kV_0^{-1}+ \tfrac{1}{\ksig}\kZ\kt\kZ\kr^{-1},\quad \text{and}\\
&\kmu_1 = \kV_1*\kl \kV_0^{-1}\kmu_0 + \tfrac{1}{\ksig}\kZ\kt\ky\kr
\end{split}
\end{equation}
%
where $\kV_0$ is a function of $\kgam$, as shown in (\ref{V0}). Furthermore:
%
\begin{equation}
\label{condPostSig}
\begin{split}
&\ksig |\ktheta,\ky,\kZ \sim ig\kl \nu_1,\tau_1\kr, \quad \text{with} \\
&\nu_1= \frac {2\nu_0+n}{2},\quad \text{and} \\
&\tau_1 = \tau_0 + \tfrac{1}{2} \kl \ky - \kZ\ktheta\kr\kt \kl \ky-\kZ\ktheta\kr
\end{split}
\end{equation}
%
The third step of the Gibbs Sampler is the core of the SSVS engine: it updates the indicator vector $\kgam$, which, in turn, feeds into an updated prior variance $\kV_0$.
%
Formally, the conditional posterior kernel for $\gamma_j$ is given as
\begin{equation}
\label{CondPostgamma}
\begin{split}
&p\kl \gamma_j | \beta_j \kr \propto \\
&p_0^{\gamma_j} \kl 1- p_0\kr^{\kl 1-\gamma_j\kr}* \gamma_j \phi \kl 0,c^2*t^2\kr + \kl 1-\gamma_j\kr \phi\kl 0,t^2\kr
\end{split}
\end{equation}
Draws can be obtained via the following strategy:
First, we update the probability that a given $\gamma_j$ takes a value of 1, i.e that the associated $\beta_j$ comes from the ``healthy'' normal density:
%
\begin{equation}
\label{gammaUpdate}
\begin{split}
&p_{1,j}=pr\kl \gamma_j=1 | \beta_j \kr = \frac{pr\kl \gamma_j=1,\beta_j\kr}{p\kl \beta_j\kr}=\\
&\frac{p\kl \beta_j | \gamma_j=1\kr pr\kl \gamma_j=1\kr}{p\kl \beta_j\kr} =\\
&\frac{p_0*\phi\kl \beta_j;0,c^2t^2\kr}{p_0*\phi\kl \beta_j;0,c^2t^2\kr + \kl 1-p_0\kr \phi\kl \beta_j;0,t^2\kr}
\end{split}
\end{equation}
%
We then draw a random uniform $u_j$ and compare it to $p_{1,j}$. We set $\gamma_j=1$ if $p_{1,j}>u_j$, and to zero otherwise. Formally, this step can be expressed as
\begin{equation}
\label{gammaUpdate2}
\gamma_j |\kbeta \sim Bin\kl p_{1,j},1 \kr
\end{equation}
%
with $p_{1,j}$ given in (\ref{gammaUpdate}).\\
Matlab scripts \ktt{mod8\_SSVS\_data, mod8\_SSVS} and function \ktt{gs\_SSVS} show the implementation of this approach. The derivation of posterior inclusion probabilities and posterior model probabilities is illustrated in script \ktt{mod8\_SSVS\_modProb}.
\end{enumerate}
\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{Markov-Chain Monte Carlo Model Composition ($MC^3$)}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
This is an alternative approach to model search and selection that is based on comparing models with different parameter space. It has the following salient features:
\begin{enumerate}
\item Models are defined as different combinations of included variables.
\item Each model receives its own \emph{model prior}.
\item Implementation proceeds via a standard Gibbs Sampler \emph{with} an MH step for model selection
\item At each iteration, a candidate model is drawn from the \emph{neighborhood} of the current model. It is accepted or rejected with a specific probability (as is the standard case in a MH algorithm).
\item The algorithm is designed to ``visit'' models with many relevant variables more often.
\item Posterior model probabilities can be computed empirically and analytically. The resulting comparison is used as one of the diagnostic tools to assess convergence.
\item Model-averaged results can be obtained directly from the output of the Gibbs Sampler.
\end{enumerate}
As before, the constant term will be included in every model. This poses a bit of a problem as it is difficult to assign a prior to it without using the data. For example, the algorithm may choose a model that has no explanatory variables. In that case the intercept, call it $\alpha$, becomes the prior expectation of the dependent variable. What should we use for that value?\\
As a result, two preparatory steps are taken. First, the intercept is given an improper (= not integrating to one) prior, i.e.
%
\begin{equation}
p\kl \alpha \kr \propto 1
\end{equation}
%
However, this would break the conjugacy of the prior for the remaining coefficients $\kbeta$ and the error variance $\ksig$, which is needed to make the MH step work. This can be overcome via the second preparatory intervention, de-meaning the regressors (= subtracting the mean from all of the candidate explanatory variables). This then implies that the ``variable'' associated with the intercept, i.e. the column of ones, is orthogonal to the remaining regressor matrix, i.e.
%
\begin{equation}
\ki\kt\kX = \kzero
\end{equation}
%
Note that de-meaning does not change the interpretation of the slope coefficients $\kbeta$.\\
Letting $\ktheta=\begin{bmatrix} \alpha & \kbeta\kt \end{bmatrix}\kt$ and $\kZ = \begin{bmatrix} \ki & \kX \end{bmatrix}$, we can now write
%
\begin{equation}
\begin{split}
&\kl \ky - \kZ\ktheta\kr\kt \kl \ky - \kZ\ktheta\kr = \kl \ky - \ki\alpha -\kX\kbeta\kr\kt \kl \ky - \ki\alpha -\kX\kbeta\kr=\\
&\kl \ky -\kX\kbeta\kr\kt \kl \ky -\kX\kbeta\kr - 2\alpha\ki\kt\ky + 2\alpha\ki\kt\kX\kbeta+\ki\kt\ki\alpha^2=\\
&\kl \ky -\kX\kbeta\kr\kt \kl \ky -\kX\kbeta\kr - 2\alpha\ki\kt\ky +\ki\kt\ki\alpha^2
\end{split}
\end{equation}
%
The second-to-last term in the second line drops out due to the imposed orthogonality via de-meaning. This is crucial in de-linking the posterior of the intercept from the rest of the model.\\
We follow standard convention (see e.g. \citeasnoun{fernandezetal2001}) and choose an improper prior to the error variance and a conjugate $g$-prior for the $k-1$ by $1$ coefficient vector $\kbeta$. Such a $g$-prior greatly reduces the number of parameters that need to be determined by the researcher a priori, facilitates the interpretation of posterior results, and assures speedy model evaluation as part of the search process.\footnote{Specifically, the conjugate $g$-prior, along with our other prior choices, assures an analytical expression for the the marginal likelihood. This, in turn, is an integral component of the model selection step of the posterior simulator, as discussed below in more detail.} Thus, we have
%
\begin{equation}
\begin{split}
&p\kl \ksig \kr \propto \frac{1}{\ksig}\\
&p\kl \kbeta |\ksig\kr = n\kl \kmu_0, g\ksig \kl\kX\kt\kX\kr ^{-1}\kr
\end{split}
\end{equation}
%
where $n$ denotes the $k-1$-variate normal density. We set $\kmu_0 = \kzero$.\\
Let $\kgam$ be a $k-1$ by 1 vector of binary indicators that signal if a given covariate $\kx_j$ is included ($\gamma_j=1$) or excluded ($\gamma_j=0$) for a given model. Thus, we can express conditionality on a specific model, i.e. a specific mix of regressors, as conditionality on $\kgam$.\\
Building on the results of the conjugate normal regression model (and letting $\kV_0 = \kgX$), the conditional posterior for $\kbeta$ is again a ``well-behaved'' normal density, i.e
%
\begin{equation}
\begin{split}
&\kbeta |\ksig,\gamma,\ky,\kX \sim n\kl \kmu_1, \kV_1\kr \quad \text{with}\\
&\kV_1 = \ksig\frac{g}{1+g}\kl \kX\kt\kX\kr^{-1}\\
&\kmu_1 = \frac{g}{1+g}\kl \kX\kt\kX\kr^{-1}\kX\kt\ky
\end{split}
\end{equation}
%
Similarly, the conditional posterior distribution of $\alpha$ can be immediately derived as
%
\begin{equation}
\begin{split}
&p\kl \alpha |\ksig,\ky\kr \propto \\
&exp\kl -\frac{1}{2\ksig}\kl n\alpha^2 - 2\alpha \sum_{i=1}^{n}y_i\kr\kr=\\
&exp\kl -\tfrac{1}{2}\kl\tfrac{\ksig}{n}\kr^{-1}\alpha^2 + \alpha \kl\tfrac{\ksig}{n}\kr^{-1}\bar{y}\kr
\end{split}
\end{equation}
%
We recognize this as the kernel of the normal density with mean $\bar{y}$ and variance $\tfrac{\ksig}{n}$.\\
The \emph{marginal} posterior for the error variance follows again an inverse-gamma distribution. Specifically,
%
\begin{equation}
\begin{split}
&\ksig |\gamma,\ky,\kX \sim ig\kl \nu_1, \tau_1\kr \quad \text{with}\\
&\nu_1 = \frac{n-1}{2}\\
&\tau_1 = \tfrac{1}{2}\kl \frac{g}{1+g}\ky\kt\kM_X\ky+
\frac{1}{1+g}\kl \ky-\ki\bar{y}\kr\kt \kl \ky-\ki\bar{y}\kr - \frac{ng}{1+g}\bar{y}^2\kr \quad \text{where}\\
&\kM_X = \kI-\kX\kl \kX\kt\kX\kr^{-1}\kX\kt
\end{split}
\end{equation}
%
The marginal, model-conditioned likelihood is proportional to the following expression:
%
\begin{equation}
\begin{split}
&p\kl \ky |\kgam, \kX\kr \propto \kl\frac{1}{1+g}\kr^{k/2} \tau_1^{-\nu_1} =\\
&\kl\frac{1}{1+g}\kr^{k/2} \kl\tfrac{1}{2}\kl \frac{g}{1+g}\ky\kM_X\ky+
\frac{1}{1+g}\kl \ky-\ki\bar{y}\kr\kt \kl \ky-\ki\bar{y}\kr -\frac{ng}{1+g}\bar{y}^2\kr\kr^{-\kl\frac{n-1}{2}\kr}
\end{split}
\end{equation}
%
where $k$ is the dimension of $\beta$ under model $\gamma$.
%
The key step in the $MC^3$ algorithm is the model selection, implemented through draws of $\kgam$. This requires a Metropolis-Hastings (MH) step with acceptance probability
%
\begin{equation}
\label{MHalpha}
\begin{split}
a=min\kl 1, \frac{q\kl \kgam_0|\kgam_c\kr p\kl \ky |\kgam_c, \kX\kr p\kl \kgam_c \kr}{q\kl \kgam_c|\kgam_0\kr p\kl \ky |\kgam_0, \kX\kr p\kl \kgam_0 \kr}\kr
\end{split}
\end{equation}
%
where, as before, ``0'' denotes the old or current model, and ``c'' denotes the new or candidate model. The function $q\kl \kgam_i|\kgam_j\kr$ is the candidate generating density for model $\kgam_i$, given model $\kgam_j$.\\
In our case, this boils down to the discrete probability of drawing model $\kgam_i$ from the neighborhood of current model $\kgam_j$. Thus, the intuition for this proposal is similar to that behind the random-walk MH algorithm encountered in the previous chapter. In our simple case of regressor selection, there are $k+1$ candidate models (including the current one) at every iteration. This implies that $q\kl \kgam_i|\kgam_j\kr = 1/\kl k+1 \kr$ for any $i,j$, and the $q\kdot$ terms drop out of the ratio in (\ref{MHalpha}).\\
If, in addition, the model priors $p\kl \kgam_i \kr$ are equal for all models, the MH acceptance formula reduces to a simple Bayes Factor of model $c$ versus model 0:
%
\begin{equation}
\begin{split}
a=min\kl 1, \frac{p\kl \ky |\kgam_c, \kX\kr}{p\kl \ky |\kgam_0, \kX\kr}\kr
\end{split}
\end{equation}
%
Matlab script \ktt{mod8\_MC3} with function \ktt{gs\_MC3} illustrate this procedure. Since $g$ becomes the only tuning parameter in the algorithm, care must be taken in its choice. \citeasnoun{fernandezetal2001} discuss several options. They find that setting $g=n$ performs well in most general settings. We follow this suggestion in our code.\\
The programming code includes two other important features: (i) after each draw of $\kgam$, the regressor matrix must be adjusted accordingly (add or delete columns of $\kX$), and (ii) all collected draws of $\kbeta$ should be $\kl k-1\kr x 1$, with the actually drawn $\beta$'s placed in the correct positions and zeros in all other positions. The resulting sequence of draws will then reflect how often a given coefficient was set to zero (i.e. how often a given regressor was omitted) by the algorithm.\\
Conveniently, and in contrast to the SSVS method, the posterior draws of model parameters are already ``model-averaged'', and model-averaged inference can be drawn by evaluating the entire set of posterior draws as usual.\\
The draws of $\kgam$ can be used to (i) examine the posterior inclusion probabilities for each coefficient, (ii) identify the most frequently visited models, and (iii) perform a convergence check as suggested by \citeasnoun{fernandezetal2001} by comparing empirical model probabilities to analytical probabilities (based on the known form of the marginal likelihood). Specifically, the correlation coefficient between these to sets of probabilities for, say, the top 1000 visited models should be close to 1 if the algorithm has converged. Naturally, our parameter-specific convergence tools (IEF, CD) still apply.\\
Matlab script \ktt{mod8\_MC3\_modProb} shows the derivation of empirical model probabilities. Script \ktt{mod8\_MC3\_convTest} shows the derivation of \emph{analytical} model probabilities and the implementation of this convergence check.
Matlab scripts \ktt{mod8\_MC3\_growth}, \ktt{mod8\_MC3\_growth\_modProb}, and \ktt{mod8\_MC3\_growth\_convTest} apply this framework to the macroeconomic growth regression discussed in \citeasnoun{fernandezetal2001a}.
\bibliography{AAEC6984bib}
\bibliographystyle{dcu}
\end{document}