Source-plane likelihood

Positions

The easiest way to evaluate the likelihood above is to avoid, as much as possible, the computation of the predicted images $\{ \theta_i \}$. This task requires the inversion of the non-linear lens mapping, something non-trivial and rather time consuming.

The approach uses a simple idea. Consider the lens equation

\[\beta = f(\theta) \; .\]

If $\beta$ is a regular value, this non-linear equation admits local inverses that we call $f^{-1}_i$: that is, each predicted image $\theta_i$ of the $\beta$ will be given by

\[\theta_i = f^{-1}_i(\beta) \; .\]

We can perform a Taylor expansion to first order of these equations around $f(\hat\theta_i) \equiv \hat\beta_i$ to obtain

\[\theta_i = f^{-1}_i(\beta) \approx f^{-1}_i(\hat\beta_i) + \left . \frac{\partial f^{-1}_i}{\partial \beta} \right|_{\hat\beta_i} (\beta - \hat\beta_i) = \hat\theta_i + \left . \frac{\partial f^{-1}_i}{\partial \beta} \right|_{\hat\beta_i} (\beta - \hat\beta_i) \; .\]

From the inverse function theorem, we can write the Jacobian of the inverse $f_i^{-1}$ evaluated at $\hat\beta_i$ as the inverse of the Jacobian of $f$ evaluated at the corresponding point, $\hat\theta_i$. We find therefore

\[\hat\theta_i - \theta_i = A_i^{-1} (\hat\beta_i - \beta) \; ,\]

where we have called

\[A_i = \left. \frac{\partial f}{\partial \theta} \right|_{\hat\theta_i}\]

Using this approximation in the log-likelihood above we find

\[\log P(\{ \hat\theta_i \} \mid \beta) = \sum_{i=1}^I \frac{1}{2} \log{\left| \frac{\mathbf\Theta_i}{2\pi} \right|} - \frac{1}{2} (\hat\beta_i - \beta)^T B_i (\hat\beta_i - \beta) \; ,\]

where $B_i \equiv A_i^{-T} \mathbf\Theta_i A_i^{-1}$ is a symmetric matrix representing the $i$-th precision $\mathbf\Theta_i$ mapped into the source plane.

Taken as a function for $\beta$, this expression can be written as a bi-variate Gaussian:

\[\log P(\{ \hat\theta_i \} \mid \beta) = \mathrm{const} - \frac{1}{2} (\beta - \bar\beta)^T B (\beta - \bar\beta) \; ,\]

where the mean $\bar\beta$ and the precision $B$ are given by

\[B = \sum_{i=1}^I A_i^{-T} \mathbf\Theta_i A_i^{-1} = \sum_{i=1}^I B_i \; ,\]

\[\bar\beta = B^{-1} \sum_{i=1}^I A_i^{-T} \mathbf\Theta_i A_i^{-1} \hat\beta_i = B^{-1} \sum_{i=1}^I B_i \hat\beta_i \; .\]

As a result, we can immediately write the maximum-likelihood solution (and its precision matrix) as $\bar\beta$ (and $B$).

In a frequentist approach, we could just substitute the maximum-likelihood solution $\beta = \bar\beta$ into the likelihood and compute its maximum-likelihood value: this, in turn, could be used in an outer optimization loop, where we would change the lensing parameters until we obtain a maximum of this source-plane likelihood.

If, instead, we adopt a Bayesian approach, we should proceed as explained above and compute the marginalized likelihood over the source positions. Since the conjugate prior in our case is also a bi-variate normal distribution, we assume that $\beta$ is distributed as

\[P(\beta \mid \hat\beta_0, B_0) = \sqrt{\biggl| \frac{B_0}{2\pi} \biggr|} \exp\left[ -\frac{1}{2} (\beta - \hat\beta_0)^T B_0 (\beta - \hat\beta_0) \right] \; .\]

Here $\hat\beta_0$ and $B_0$ are meta-parameters that describe the prior on the source position: $\hat\beta_0$ is the center of the distribution, and $B_0$ is its precision. Typically, one will star with a very loose prior, and therefore one might set $B_0$ might be chosen as a (very) small multiple of the identity matrix.

In this approach, we apply Bayes' theorem and write the posterior distribution for $\beta$ as

\[P(\beta \mid \{ \hat\theta_i \}) = \frac{\log P(\{ \hat\theta_i \} \mid \beta) P(\beta \mid \hat\beta_0, B_0)} {\int \log P(\{ \hat\theta_i \} \mid \beta') P(\beta' \mid \hat\beta_0, B_0) \, \mathrm{d}\beta'}\]

A simple calculation then shows that the numerator of the posterior, i.e. the expression $P(\{ \hat\theta_i \} \mid \beta) P(\beta \mid \beta_0, B_0)$, taken as a function of $\beta$ is proportional to a multivariate normal distribution with updated meta-parameters

\[B = B_0 + \sum_{i=1}^I B_i = \sum_{i=0}^I B_i \; , \]

\[\bar\beta = B^{-1} (B_0 \beta_0 + B \beta) = B^{-1} \sum_{i=0}^I B_i \hat\beta_i \; .\]

Note how these results are formally identical to the ones found above, with the (slight) difference that the sums starts at $i = 0$ and includes, therefore, the prior meta-parameters.

The normalizing constant, i.e. the denominator in the posterior distribution, is generally called the evidence or the marginalized likelihood. Its logarithm can be written as

\[2 \log E = \sum_{i=0}^I \biggl[ \log{\bigg| \frac{\mathbf\Theta_i}{2\pi}\biggr| } - \hat\beta_i^T B_i \hat\beta_i \biggr] - \biggl[ \log \biggl| \frac{B}{2\pi} \biggr| - \bar\beta^T B_0 \bar\beta \biggr] \; ,\]

or, equivalently,

\[2 \log E = \sum_{i=0}^I \biggl[ \log{\bigg| \frac{\mathbf\Theta_i}{2\pi}\biggr| } - (\hat\beta_i - \bar\beta)^T B_i (\hat\beta_i - \bar\beta) \biggr] - \log{\biggl| \frac{B}{2\pi} \biggr|} \; .\]

Note that in these expressions we have defined $\mathbf\Theta_0 \equiv B_0$, so to be able to include the prior normalizing constant in a way similar to the other measurements.

The evidence $E$, in our a Bayesian approach, replaces the likelihood, as it is effectively a likelihood marginalized over the source position (or, more generally, the source parameters).

In many cases it is convenient to consider also an evidence based on a flat prior. In our notation, this corresponds to taking the limit

\[E_0 = \lim_{\mathbf\Theta_0 \rightarrow 0} E \, \biggl| \frac{\mathbf\Theta_0}{2\pi} \biggr|^{-1/2} \; ,\]

i.e. to the use of an improper prior $P(\beta_0) = 1$. The expressions obtained for $\log E_0$ are formally identical to the expressions above for $\log E$, with the sums starting at $i = 1$. More explicitly:

\[2 \log E_0 = \sum_{i=1}^I \biggl[ \log{\bigg| \frac{\mathbf\Theta_i}{2\pi}\biggr| } - \hat\beta_i^T A_i^{-T} \mathbf\Theta_i A_i^{-1} \hat\beta_i \biggr] - \log{\biggl| \frac{B}{2\pi} \biggr|} + \bar\beta^T B \bar\beta \; ,\]

or, equivalently,

\[2 \log E_0 = \sum_{i=1}^I \biggl[ \log{\bigg| \frac{\mathbf\Theta_i}{2\pi}\biggr| } - (\hat\beta_i - \bar\beta)^T A_i^{-T} \mathbf\Theta_i A_i^{-1} (\hat\beta_i - \bar\beta) \biggr] - \log{\biggl| \frac{B}{2\pi} \biggr|} \; .\]

Naively, we could imagine that this expression is equivalent to a substitution of the maximum-likelihood solution $\beta = \bar\beta$ inside the likelihood. This, however, is not the case: the additional term to the right, $\log|B/2\pi|$, is a non-trivial addition, since it is influenced by the Jacobian matrices of the lens-mapping equation at the observed image positions $\{ A_i \}$, and therefore depends on the chosen lens model. The use of this marginalized likelihood is controlled in the code by the flag bayesianfactor.

Magnitudes

Luminous sources are characterized by their position and luminosity. Here, we prefer to describe them in terms of magnitudes, as this is more directly interpreted in terms of practical measurements, and because of the possible use of a specific prior (see below).

As usual, we can proceed as before and assume initially that the unknown source magnitude $M$ has a normal (Gaussian) distribution for its prior:

\[P(M ∣ \hat M_0, \lambda_0) = \sqrt{\frac{\lambda_0}{2\pi}} \exp \biggl[ -\frac{1}{2} \lambda_0 (\hat M - M_0)^2 \biggr]\]

With a calculation similar to the one discussed above, we easily find that the posterior probability distribution for the source magnitude $M$ is a Gaussian with precision (inverse variance)

\[\lambda = \sum_{i=0}^I \lambda_i\]

and mean

\[\bar M = \lambda^{-1} \sum_{i=0}^I \lambda_i (\hat m_i + \mathit{LM}_i) = \lambda^{-1} \sum_{i=0}^I \lambda_i \hat M_i \; ,\]

where we have called $\hat M_i \equiv \hat m_i + \mathit{LM}_i$. The corresponding evidence is

\[2 \log E = \sum_{i=0}^I \biggl[ \log \frac{\lambda_i}{2\pi} - \lambda_i \hat M_i^2 \biggr] - \biggl[ \log \frac{\lambda}{2\pi} - \lambda \bar M^2 \biggr] \; .\]

or, equivalently,

\[2 \log E = \sum_{i=0}^I \biggl[ \log \frac{\lambda_i}{2\pi} - \lambda_i \bigl( \hat M_i - \bar M \bigr)^2 \biggr] - \log \frac{\lambda}{2\pi} \; .\]

As before, it can be sensible to consider the case of an improper flat prior and consider the limit $\lambda_0 \rightarrow 0$. As before, this will produce an evidence $E_0$ which is formally identical to the expression for $E$, with the sum starting at $i = 1$.

For magnitudes, it might also be interested to consider the improper prior

\[P(\hat M_0 \mid \alpha) \propto \mathrm{e}^{\alpha \hat M_0} \; .\]

This model is justified by the number counts of distant sources, which under mild hypotheses are distributed according to an exponential law with $\alpha = 0.6 \ln 10 \approx 1.38$. In this case we find

\[\lambda = \sum_{i=1}^I \lambda_i \; , \qquad \bar M = \lambda^{-1} \biggl[ \alpha + \sum_{i=1}^I \lambda_i \hat M_i \biggr] \; ,\]

while the evidence under the new prior is still

\[2 \log E_\alpha = \sum_{i=1}^I \biggl[ \log \frac{\lambda_i}{2\pi} - \lambda_i \hat M_i^2 \biggr] - \biggl[ \log \frac{\lambda}{2\pi} - \lambda \bar M^2 \biggr] \; .\]

Note how $\alpha = 0$ returns the usual expressions for a flat improper prior.

Quadrupole measurements

Let us call $\mathbf{S}$ the inverse of the source quadrupole. Since $\langle \hat Q_i \rangle = \nu_i \mathbf{W}_i^{-1}$, and since quadrupole moments transforms as rank-two tensors, we have

\[\mathbf{W}_i = \nu_i A_i^{T} \mathbf{S} A_i \; .\]

If we insert this equation in the likelihood for the quadrupole moments we find

\[\log P \bigl(\{ \hat Q_i \} \mid \mathbf{S}, \{ \nu_i \} \bigr) = \sum_{i=1}^I \biggl[ \frac{\nu_i}{2} \log \bigl| \nu_i A_i^{T} \mathbf{S} A_i \bigr| - \log \bigl[ 4 \pi \Gamma(\nu_i - 1) \bigr] + \frac{\nu_i - 3}{2} \log \bigl| \hat Q_i \bigr| - \frac{\nu_i}{2} \mathrm{tr} (A^T_i \mathbf{S} A_i \hat Q_i) \biggr] \; .\]

In order to proceed, we now consider the prior distribution for the inverse source quadrupole $\mathbf{S}$. We decide to use for this quantity also as a Wishart prior, with meta-parameters $\hat O_0$ and $\nu_0$:

\[\log P(\mathbf{S} \mid \hat O_0, \nu_0) = \frac{\nu_0}{2} \log \bigl| \nu_0 \hat O_0 \bigr| - \log \bigl[ 4 \pi \Gamma(\nu_0 - 1) \bigr] + \frac{\nu_0 - 3}{2} \log | \mathbf{S} | - \frac{\nu_0}{2} \mathrm{tr}(\hat O_0 \mathbf{S}) \; .\]

Note that our choice for the prior of $S$ implies that the source quadrupole will have an Inverse Wishart distribution as prior. The product between the prior and the likelihood, taken as a function of $\mathbf{S}$, still has the same form of the prior. To show this, one can use the cyclic property of the trace and the expression for the determinant of the product of matrices. The posterior will have updated meta-parameters

\[\nu = \nu_0 + \sum_{i=1}^I \nu_i = \sum_{i=0}^I \nu_i\]

and

\[\bar O = \frac{1}{\nu} \biggl[ \nu_0 \hat O_0 + \sum_{i=1}^I \nu_i A_i \hat Q_i A_i^T \biggr] = \frac{1}{\nu} \sum_{i=0}^I \nu_i \hat O_i \; ,\]

where we have called $\hat O_i \equiv A_i \hat Q_i A_i^T$. The expression above shows that, essentially, the parameter $\bar O$, representing an estimate for the source quadrupole, is a weighted average of the measured quadrupole moments projected into the source plane, with weights $\nu_i$.

The evidence associated to the updated distribution is

\[\log E = \sum_{i=0}^I \frac{\nu_i}{2} \log \bigl| \nu_i \hat O_i \bigr| - \frac{\nu}{2} \log \bigl| \nu \bar O \bigr| - \sum_{i=0}^I \log \bigl[ 4 \pi \Gamma(\nu_i - 1) \bigr] + \log \bigl[ 4 \pi \Gamma(\nu - 1) \bigr] - \frac{3}{2} \sum_{i=1}^I \log \bigl| \hat Q_i \bigr| \; .\]

This expression, whose symmetry is evident, depends on the lensing model only through the first two terms, as are both functions of the set of lensing Jacobians $\{ A_i \}$. The other terms, instead, only depends on the measurements. Note also how the largest evidence is obtained when all projected quadrupole moments $\{ \hat O_i \}$ agree, as in this case there are no cancellation effects in the expression for $\bar O$.

As before, we can consider an uninformative improper prior characterized by $\nu_0 = 0$. In that case the expressions above will just be essentially the same, with all sums starting from $i = 1$.

Time-delays

As mentioned already, the equations that describe the time-delays are formally identical to the ones that describe the magnitudes. They are both scalar quantities that depends from a single scalar source parameter (the unlensed magnitude $M$ or the reference time $T$) through additive terms (the lensing modules $\mathit{LM}_i$ or the time-delays $T_i$). Moreover, for the measurements of both quantities we assume simple normal distributions.

Therefore, all equations found in the previous section applies, with the due variable changes, for time-delay measurements. In particular, with a flat prior for $T$ we find

\[\tau = \sum_{i=1}^I \tau_i \, \qquad \bar T = \tau^{-1} \sum_{i=1}^I \tau_i (\hat t_i - T_i) = \tau^{-1} \; ,\]

and evidence

\[2 \log E_0 = \sum_{i=1}^I \biggl[ \log \frac{\tau_i}{2\pi} - \tau_i \bigl( \hat t_i - T_i - \bar T \bigr)^2 \biggr] - \log \frac{\tau}{2\pi} \; .\]