Basic Bayesian techniques
Bayes' theorem
Bayes' theorem is a fundamental result in probability theory and its use is pervasive in Gravity.jl. It allows us to compute the probability of an event $A$ given the occurrence of another event $B$. It is written as
\[P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)} \; .\]
In the context of gravitational lensing, we are interested in the probability of a set of parameters $x$ given the data $D$. We can write this probability as
\[P(x \mid D) = \frac{P(D \mid x) P(x)}{P(D)} \; .\]
The quantity $P(x \mid D)$ is called the posterior distribution of $x$ given $D$, and is generally the final result we are interested in. It depends on $P(D \mid x)$, i.e. the likelihood of the data given the parameters, and on $P(x)$, the so called the prior distribution of the parameters. The likelihood generally encapsulates both the physical model (in our case, a multi-plane lensing system) and the statistical properties of the measurements. The prior, instead, is a description of our believes on the parameters $x$ before even looking at the data.
The quantity $P(D)$ is called the evidence, and can be expressed as a marginalized likelihood:
\[P(D) = \int P(D \mid x') P(x') \, \mathrm{d} x' \; ,\]
where the integral is carried out over the relevant domain of the parameters $x$. The evidence is a normalization factor that ensures that the posterior is a proper probability distribution. It is usually very difficult to compute analytically, and it is often ignored in practice. It is, however, the single number most important quantity, and it has a key role for the assessment of the goodness of a model in the Bayesian framework. For this reason, in Gravity.jl we provide various tools to evaluate the evidence.
A note on the Bayesian approach to marginalized parameters
When performing a Bayesian analysis of a complex system, we are often uninterested in a set of nuisance parameters that are needed in the modeling of the problem. A relevant example in our context is the case of source parameters in the modeling of a gravitational lens system.
In general, we can consider the likelihood associated to a gravitational lens system as formed by (at least) two set of parameters: source parameters $S$ (in the simpler case, the positions of all sources; in more complicated cases also their luminosities and shapes) and lens parameters $L$. In more complex case we might have more parameters, for example associated to the cosmological model, but for simplicity we will here ignore these cases.
Let us call $D$ the data obtained (for example $D = \{ \hat\theta_n \}$ if we limit our analysis to the image positions). We write the likelihood, i.e. the conditional probability of the data given the parameters, as $P(D \mid S, L)$. This quantity is then used in Bayes' theorem, so that we have
\[P(L \mid D) = \int P(S, L \mid D) \, \mathrm{d}S = \frac{P(L) \int \mathrm{d}S \, P(S) P(D \mid S, L)} {\int \mathrm{d}L' \, P(L') \int \mathrm{d}S' \, P(D \mid S', L') P(S')} \; .\]
As a result, if we write the conditional distribution $P(D \mid L)$ as a marginalization over $S$ of the full likelihood,
\[P(D \mid L) = \int P(D \mid S, L) P(S) \, \mathrm{d}S \; ,\]
we see that we can recover the usual form of Bayes' theorem
\[P(L \mid D) = \frac{P(D \mid L) P(L)}{\int P(D \mid L') P(L') \mathrm{d}L'}\]
In a sense, this marginalization corresponds to the computation of a partial evidence over the source position. The same result can be obtained by considering the definition of the conditional probability:
\[P(D \mid L) = \frac{P(D, L)}{P(L)} = \frac{\int P(D, L \mid S) P(S) \, \mathrm{d}S}{P(L)} = \int \frac{P(D, L \mid S)}{P(L)} P(S) \, \mathrm{d}S = \int P(D \mid S, L) P(S) \, \mathrm{d}S \; .\]
Therefore, it is sensible to compute this marginalized conditional distribution. This can be done by adopting the technique described in the following subsection.
Conjugate priors
It is sensible to assume that the measurements of the parameters characterizing our point-images, such as their position, their luminosity, or their shape, follow simple probability distributions. For example, position measurements can be taken to be distributed as bi-variate Gaussian. In these situations, with a suitable choice of the prior (using the so-called conjugate prior), we can make sure that the posterior belongs to the same family of the prior. This greatly simplifies the calculations and allow us to compute analytically the evidence required, as explained below.